prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

TPCH connector return different results on prestissimo for l_quantity #24010

Closed kewang1024 closed 6 days ago

kewang1024 commented 1 week ago
select l_quantity, l_orderkey from tpch.sf1.lineitem where l_orderkey = 321030;

Java

_quantity | l_orderkey 
------------+------------
       0.04 |     321030 
       0.17 |     321030 
       0.04 |     321030 
       0.04 |     321030 
       0.48 |     321030

Prestissimo

l_quantity | l_orderkey
------------+------------
        4.0 |     321030
       17.0 |     321030
        4.0 |     321030
        4.0 |     321030
       48.0 |     321030
kewang1024 commented 1 week ago

cc: @amitkdutta, @aditi-pandit

amitkdutta commented 1 week ago

CC: @pramodsatya

I just had a chat with @aditi-pandit, she confirmed its a data generation issue - rather than a table reader bug. Its coming from here https://github.com/facebookincubator/velox/blob/main/velox/tpch/gen/TpchGen.cpp#L80

pramodsatya commented 1 week ago

Yes this is a data generation issue and the velox Tpch connector seems to be generating incorrect data (when compared with Presto). For the column l_quantity, this seems to be because of the decimalToDouble function. The data generated with the Prestissimo Tpch connector for other columns in lineitem such as l_partkey and l_suppkey, and columns in other tables such as part, also do not match with that generated by Presto so this needs further investigation. We need to compare the data generated by the velox connector against the specs and against Duckdb to ascertain if the issue is with Presto or the Duckdb dbgen tool used by the velox connector. Will update the issue with my findings from this comparison.

Also as per the Tpch specs, l_quantity and other columns which are of type double in Presto should have type Decimal instead (found an old issue reg this: https://github.com/prestodb/presto/issues/3557). So the table schema generated by Presto Tpch connector also needs to be modified. I noticed that class TpchColumnType, from the airlift dependency used by the Presto Tpch connector, doesn't support Decimal types so also need to look at how Decimal support would be added here.

amitkdutta commented 1 week ago

Thanks @pramodsatya. Is this one similar issue as well? https://github.com/prestodb/presto/issues/24011

pramodsatya commented 1 week ago

Thanks @pramodsatya. Is this one similar issue as well? #24011

Yes @amitkdutta, the checksum mismatch here appears to be because of data mismatch between Presto and Prestissimo Tpch connectors for column o_comment, this mismatch is seen in Varchar columns from other tables as well. @majetideepak, could you please share if the data mismatch for Varchar columns is expected and is a known issue?

majetideepak commented 6 days ago

@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?

majetideepak commented 6 days ago

decimalToDouble

@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.

pramodsatya commented 6 days ago

@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?

@majetideepak, yes the Varchar data is mismatched too and it differs from Presto Java, this is from checking data with SF1 for following columns from three tables: p_comment in part, o_comment in orders and l_comment in lineitem. The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

pramodsatya commented 6 days ago

decimalToDouble

@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.

Sure, I opened a PR in velox to fix the data generated for column l_quantity. Could you please help review the change? https://github.com/facebookincubator/velox/pull/11511

Upon checking the Tpch SF1 data generated by Prestissimo and Presto and comparing them against Duckdb and the spec, it looks like there is a parity only for Varchar columns and the column l_quantity. The data mismatch for other columns in lineitem noticed before was with tiny scale factor and is not seen with SF1.

majetideepak commented 6 days ago

The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.

pramodsatya commented 6 days ago

The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.

@majetideepak, apologies for the confusion. This comment was only with respect to the Varchar columns containing comments. After this fix, except for the Varchar columns with the field comment, rest of the columns generated with Presto and Prestissimo match for SF1. The incorrect Varchar columns generated by Duckdb's dbgen would still need to be fixed (for both SF1 and SF 0.01). For tiny SF, there is a parity between Presto and Prestissimo for the Varchar columns with the field comment, and for other lineitem columns such as l_partkey and l_suppkey. However, the data generated with Presto Java matches the spec and the mismatch in Prestissimo appears to be because of Duckdb's dbgen tool. Tiny SF doesn't seem to be supported in Duckdb's Tpch extension; the data generated for tiny SF by Prestissimo is the same as the SF1 data for columns such as l_partkey and l_suppkey but with reduced row count. While working on the Tpcds connector as well, we had noticed the minimum scale factor supported by Duckdb was 1 and changes had to be made in Duckdb's dsdgen to generate data with tiny SF. Likely similar changes are needed for dbgen too.

Is it fine if we take up all these dbgen fixes in a separate PR?

aditi-pandit commented 5 days ago

@pramodsatya : Yes, lets handle the varchar issues in https://github.com/prestodb/presto/issues/24011