pola-rs / tpch

MIT License
64 stars 36 forks source link

The rest of the queries? #39

Open marsupialtail opened 1 year ago

marsupialtail commented 1 year ago

Polars can run them for sure. Do you want a contribution?

ritchie46 commented 1 year ago

That would be great!

marsupialtail commented 1 year ago

@ritchie46 I have started but ran into a problem. Here is how I wrote query 13:

ref_customer = polars.read_csv("/home/ziheng/tpc-h/customer.tbl", sep="|")
ref_orders = polars.read_csv("/home/ziheng/tpc-h/orders.tbl", sep="|").\
    filter( ~(polars.col("o_comment").str.contains('special') & polars.col("o_comment").str.contains('requests')))
ref = ref_customer.join(ref_orders, left_on="c_custkey", right_on="o_custkey", how="left")\
    .with_column(polars.col("o_orderkey").is_not_null().alias("o_orderkey_1")).groupby("c_custkey").agg([polars.col("o_orderkey_1").sum()])\
    .groupby("o_orderkey_1").count().sort('count',reverse = True)
    #.sort('o_orderkey_1',reverse = True)

However this give wrong results. Any suggestions?

marsupialtail commented 1 year ago

NVM i know what the problem is. I need to make sure "special" comes before "requests". Have to use regex.....

ghuls commented 1 year ago

Implementation for Pandas for 22 queries: https://gist.github.com/UranusSeven/55817bf0f304cc24f5eb63b2f1c3e2cd

stinodego commented 4 months ago

Polars / pyspark / DuckDB have full query coverage. We should still include the pandas queries. Perhaps the link above could help.