prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.75k stars 5.29k forks source link

Update the TPCDS connector to be compliant with the latest spec and dsdgen #22932

Open yingsu00 opened 4 weeks ago

yingsu00 commented 4 weeks ago

The TPCDS specification and dsdgen have been updated several times in the last a few years, but Presto TPCDS connector hasn't been updated for a long time. There are several changes we noticed recently:

  1. The row count error margin was removed. It used to allow 0.01% difference in the row counts in spec v2.1.0, but the margin was removed in 2019. Now all row counts should exactly match the latest dsdgen.
  2. The old dsdgen generates different column order than the spec for the modification data sets. This bug should have been fixed in 2021.

Expected Behavior or Use Case

The Presto TPCDS connector generates exactly the same data as in the latest spec and dsdgen

Presto Component, Service, or Connector

The Presto Java based TPCDS connector

Possible Implementation

Example Screenshots (if appropriate):

Screenshot 2024-06-05 at 17 34 21 Screenshot 2024-06-05 at 17 48 49

Context

yingsu00 commented 4 weeks ago

cc @yzhang1991 @aditi-pandit @pratyakshsharma @nmahadevuni