Open l1t1 opened 1 year ago
Thanks for bringing this up! Yes, Hyper can use multiple cores. In this particular case, the input set is so tiny, that Hyper will not benefit much from multi-threading, though. Hyper's full performance will only be unleashed on much bigger data sets than 17 megabyte. I would recommend testing Hyper with data sizes of at least a couple of gigabytes.
However, I guess your actual question is not about multi-core anyway. I guess you are rather wondering: "Why is Hyper slower than DuckDB on those queries?". Let's take a closer look at this 🙂
The trick is to use CREATE TEMPORARY EXTERNAL TABLE. The difference is:
FROM 'file/path.parquet
, Hyper assumes that you will be accessing the file only once. Hence, Hyper will close and open the file multiple times. Each time, Hyper needs to rediscover the metadata on every query.CREATE EXTERNAL TABLE
, you tell Hyper that you are planning to access the table multiple times. Hyper will keep the meta data about your file in cache.Here is an updated benchmark:
import time
from tableauhyperapi import HyperProcess, Telemetry, Connection
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
with Connection(endpoint=hyper.endpoint) as connection:
a=connection.execute_command("CREATE TEMPORARY EXTERNAL TABLE tripdata FOR './yellow_tripdata_2021-06.parquet'")
t=time.time()
a=connection.execute_scalar_query("select count(1) from tripdata")
print(time.time()-t, ": ", a)
t=time.time()
a=connection.execute_scalar_query("select count(1) from tripdata")
print(time.time()-t, ": ", a)
t=time.time()
a=connection.execute_list_query("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
print(time.time()-t, ": ", a)
t=time.time()
a=connection.execute_list_query("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
print(time.time()-t, ": ", a)
Note how I first declared an external table, and then used it in the following queries.
This gives me the following numbers:
select count(1) from tripdata
(2nd time)`: 0.521select count(1) from tripdata
(2nd time): 0.006 secondsselect passenger_count,count(1) from tripdata group by passenger_count order by 1
: 0.100select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1
Note how the first time we run the first query is rather slow. This is because Hyper computes some statistics on the external table the firs time you access it. Those statistics are important to Hyper's optimizer such that it will pick a good query plan. For the simple queries we are benchmarking here, those statistics won't make much of a difference, but for more complex join queries, those statistics are vital.
The updated performance numbers of Hyper are already much closer to DuckDB. Still slightly slower - we could tune Hyper further but I am not sure this would make sene. Your benchmark data is pretty small and Hyper is more tuned towards larger data sets. I would be interested in which performance your benchmark yields on larger data sets
@vogelsgesang thank you for your detailed reply, I learned a lot. it's unfair to compare the python modules with native binaries too, one more question, does hyper database have the CLI? I found a hyperd server in the tableau desktop, but no client.
test of python module duckdb
import time
import duckdb
duckdb.sql("CREATE view tripdata as select * from 'd:/yellow_tripdata_2021-06.parquet'")
if 1==1:
t=time.time()
a=duckdb.sql("select count(1) from tripdata")
print(time.time()-t, ":\n", a)
t=time.time()
a=duckdb.sql("select count(1) from tripdata")
print(time.time()-t, ":\n", a)
t=time.time()
a=duckdb.sql("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
print(time.time()-t, ":\n", a)
t=time.time()
a=duckdb.sql("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
print(time.time()-t, ":\n", a)
query on CREATE view of duckdb is faster than CREATE TEMPORARY EXTERNAL TABLE of hyper
returns
while duckdb CLI on same machine query same file