heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.95k stars 448 forks source link

Exception: Cannot use a fast path for COUNT distinct #90

Open fleapapa opened 7 years ago

fleapapa commented 7 years ago

This query generates an error as title.

select count(distinct floor(trip_distance)) from trips;

The query works with other databases. What should i change to it to make it work with MapD?

Schema of trips table:

CREATE TABLE trips ( medallion TEXT ENCODING DICT, hack_license TEXT ENCODING DICT, vendor_id TEXT ENCODING DICT, rate_code_id SMALLINT, store_and_fwd_flag TEXT ENCODING DICT, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP,
passenger_count SMALLINT, trip_time_in_secs INTEGER, trip_distance DECIMAL(14,2), pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2) ) WITH (FRAGMENT_SIZE=75000000);

asuhan commented 7 years ago

You can use select count(distinct cast(trip_distance as integer)) from trips; or select approx_count_distinct(floor(trip_distance)) from trips; if a good (within 4%) approximation is ok. The original query would also run with the watchdog disabled, but it'd use a slow, std::set based implementation.

In order to make your query run unchanged, we just need to add range information for the FLOOR function. Not hard at all, I might do it next week.

fleapapa commented 7 years ago

Thanks. Because query "select count(distinct CEIL(trip_distance)) from trips;" doesn't work for fast path and cast seems not a substitute for CEIL(), CEIL function also need range information. Actually MOD and most other functions need the information, too.

fleapapa commented 7 years ago

BTW, another minor issue is that ROUND() seems not behaving as documented:

mapdql> select count(distinct ROUND(trip_distance)) from trips; Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 1, column 23 to line 1, column 44: Invalid number of arguments to function 'ROUND'. Was expecting 2 arguments

It seems treated as ROUND_TO_DIGIT (x, y):

mapdql> select count(distinct ROUND(trip_distance,0)) from trips; Exception: Cannot use a fast path for COUNT distinct

asuhan commented 7 years ago

Not sure what version you're using, we've removed ROUND altogether in 6f2730e5a5027b5d37eedbbf65c244e700c76436.

fleapapa commented 7 years ago

The version is MapD Server 3.2.1-20170826-327e7ee; it's older than 5 days ago :)

asuhan commented 7 years ago

Ok, that's before we've removed it. Yes, all functions you've mentioned need range information for precise count distinct using bitmap implementation.