can hyper python API use multi-core？

l1t1 commented 1 year ago

from tableauhyperapi import HyperProcess, Telemetry, Connection

with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    with Connection(endpoint=hyper.endpoint) as connection:
        import time
        t=time.time()
        a=connection.execute_scalar_query("select count(1) from 'd:/yellow_tripdata_2021-06.parquet'")
        print(a,time.time()-t)
        t=time.time()
        a=connection.execute_list_query("select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1")
        print(a,time.time()-t)
        t=time.time()
        a=connection.execute_list_query("select passenger_count,sum(trip_distance) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1")
        print(a,time.time()-t)

returns

2834264 0.18601059913635254
[[0, 66636], [1, 1968440], [2, 412798], [3, 108634], [4, 40950], [5, 67686], [6, 45562], [7, 11], [8, 5], [9, 4], [None, 123538]] 0.20101165771484375
[[0, 172554.11], [1, 5797179.629999995], [2, 1341309.7100000011], [3, 343928.14999999997], [4, 134748.31000000006], [5, 204493.66000000003], [6, 13989
3.91], [7, 33.44], [8, 9.17], [9, 0.0], [None, 11517949.330000013]] 0.2130122184753418

while duckdb CLI on same machine query same file

D select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1;
┌─────────────────┬──────────┐
│ passenger_count │ count(1) │
│      int32      │  int64   │
├─────────────────┼──────────┤
│               0 │    66636 │
│               1 │  1968440 │
│               2 │   412798 │
│               3 │   108634 │
│               4 │    40950 │
│               5 │    67686 │
│               6 │    45562 │
│               7 │       11 │
│               8 │        5 │
│               9 │        4 │
│                 │   123538 │
├─────────────────┴──────────┤
│ 11 rows          2 columns │
└────────────────────────────┘
Run Time (s): real 0.197 user 0.171601 sys 0.000000
D select passenger_count,count(1) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1;
┌─────────────────┬──────────┐
│ passenger_count │ count(1) │
│      int32      │  int64   │
├─────────────────┼──────────┤
│               0 │    66636 │
│               1 │  1968440 │
│               2 │   412798 │
│               3 │   108634 │
│               4 │    40950 │
│               5 │    67686 │
│               6 │    45562 │
│               7 │       11 │
│               8 │        5 │
│               9 │        4 │
│                 │   123538 │
├─────────────────┴──────────┤
│ 11 rows          2 columns │
└────────────────────────────┘
Run Time (s): real 0.074 user 0.156001 sys 0.046800
D select passenger_count,sum(trip_distance) from 'd:/yellow_tripdata_2021-06.parquet'group by passenger_count order by 1
> ;
┌─────────────────┬────────────────────┐
│ passenger_count │ sum(trip_distance) │
│      int32      │       double       │
├─────────────────┼────────────────────┤
│               0 │  172554.1099999999 │
│               1 │  5797179.629999994 │
│               2 │ 1341309.7100000044 │
│               3 │ 343928.15000000084 │
│               4 │ 134748.30999999997 │
│               5 │ 204493.66000000027 │
│               6 │ 139893.91000000006 │
│               7 │              33.44 │
│               8 │               9.17 │
│               9 │                0.0 │
│                 │ 11517949.330000013 │
├─────────────────┴────────────────────┤
│ 11 rows                    2 columns │
└──────────────────────────────────────┘
Run Time (s): real 0.079 user 0.296402 sys 0.140401

l1t1 commented 1 year ago

data: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-06.parquet

vogelsgesang commented 1 year ago

Thanks for bringing this up! Yes, Hyper can use multiple cores. In this particular case, the input set is so tiny, that Hyper will not benefit much from multi-threading, though. Hyper's full performance will only be unleashed on much bigger data sets than 17 megabyte. I would recommend testing Hyper with data sizes of at least a couple of gigabytes.

However, I guess your actual question is not about multi-core anyway. I guess you are rather wondering: "Why is Hyper slower than DuckDB on those queries?". Let's take a closer look at this 🙂

The trick is to use CREATE TEMPORARY EXTERNAL TABLE. The difference is:

With the adhoc-syntax FROM 'file/path.parquet, Hyper assumes that you will be accessing the file only once. Hence, Hyper will close and open the file multiple times. Each time, Hyper needs to rediscover the metadata on every query.
With CREATE EXTERNAL TABLE, you tell Hyper that you are planning to access the table multiple times. Hyper will keep the meta data about your file in cache.

Here is an updated benchmark:

import time
from tableauhyperapi import HyperProcess, Telemetry, Connection

with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    with Connection(endpoint=hyper.endpoint) as connection:
        a=connection.execute_command("CREATE TEMPORARY EXTERNAL TABLE tripdata FOR './yellow_tripdata_2021-06.parquet'")

        t=time.time()
        a=connection.execute_scalar_query("select count(1) from tripdata")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_scalar_query("select count(1) from tripdata")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_list_query("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ": ", a)

        t=time.time()
        a=connection.execute_list_query("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ": ", a)

Note how I first declared an external table, and then used it in the following queries.

This gives me the following numbers:

select count(1) from tripdata (2nd time)`: 0.521
select count(1) from tripdata (2nd time): 0.006 seconds
select passenger_count,count(1) from tripdata group by passenger_count order by 1: 0.100
select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1

Note how the first time we run the first query is rather slow. This is because Hyper computes some statistics on the external table the firs time you access it. Those statistics are important to Hyper's optimizer such that it will pick a good query plan. For the simple queries we are benchmarking here, those statistics won't make much of a difference, but for more complex join queries, those statistics are vital.

The updated performance numbers of Hyper are already much closer to DuckDB. Still slightly slower - we could tune Hyper further but I am not sure this would make sene. Your benchmark data is pretty small and Hyper is more tuned towards larger data sets. I would be interested in which performance your benchmark yields on larger data sets

l1t1 commented 1 year ago

@vogelsgesang thank you for your detailed reply, I learned a lot. it's unfair to compare the python modules with native binaries too, one more question, does hyper database have the CLI? I found a hyperd server in the tableau desktop, but no client.

l1t1 commented 1 year ago

both of following articles use python modules. https://aetperf.github.io/2023/03/30/TPC-H-benchmark-of-Hyper,-DuckDB-and-Datafusion-on-Parquet-files.html https://aetperf.github.io/2023/04/27/TPC-H-benchmark-of-DuckDB-and-Hyper-on-native-files.html

l1t1 commented 1 year ago

test of python module duckdb

import time
import duckdb

duckdb.sql("CREATE view tripdata as select * from 'd:/yellow_tripdata_2021-06.parquet'")

if 1==1:
        t=time.time()
        a=duckdb.sql("select count(1) from tripdata")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select count(1) from tripdata")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select passenger_count,count(1) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ":\n", a)

        t=time.time()
        a=duckdb.sql("select passenger_count,sum(trip_distance) from tripdata group by passenger_count order by 1")
        print(time.time()-t, ":\n", a)

query on CREATE view of duckdb is faster than CREATE TEMPORARY EXTERNAL TABLE of hyper

tableau / hyper-api-samples

can hyper python API use multi-core？ #99