Closed kewang1024 closed 6 days ago
cc: @amitkdutta, @aditi-pandit
CC: @pramodsatya
I just had a chat with @aditi-pandit, she confirmed its a data generation issue - rather than a table reader bug. Its coming from here https://github.com/facebookincubator/velox/blob/main/velox/tpch/gen/TpchGen.cpp#L80
Yes this is a data generation issue and the velox Tpch connector seems to be generating incorrect data (when compared with Presto). For the column l_quantity
, this seems to be because of the decimalToDouble
function. The data generated with the Prestissimo Tpch connector for other columns in lineitem
such as l_partkey
and l_suppkey
, and columns in other tables such as part
, also do not match with that generated by Presto so this needs further investigation. We need to compare the data generated by the velox connector against the specs and against Duckdb to ascertain if the issue is with Presto or the Duckdb dbgen tool used by the velox connector. Will update the issue with my findings from this comparison.
Also as per the Tpch specs, l_quantity
and other columns which are of type double
in Presto should have type Decimal
instead (found an old issue reg this: https://github.com/prestodb/presto/issues/3557). So the table schema generated by Presto Tpch connector also needs to be modified. I noticed that class TpchColumnType
, from the airlift dependency used by the Presto Tpch connector, doesn't support Decimal
types so also need to look at how Decimal
support would be added here.
Thanks @pramodsatya. Is this one similar issue as well? https://github.com/prestodb/presto/issues/24011
Thanks @pramodsatya. Is this one similar issue as well? #24011
Yes @amitkdutta, the checksum mismatch here appears to be because of data mismatch between Presto and Prestissimo Tpch connectors for column o_comment
, this mismatch is seen in Varchar
columns from other tables as well.
@majetideepak, could you please share if the data mismatch for Varchar
columns is expected and is a known issue?
@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?
decimalToDouble
@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.
@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?
@majetideepak, yes the Varchar data is mismatched too and it differs from Presto Java, this is from checking data with SF1 for following columns from three tables: p_comment
in part
, o_comment
in orders
and l_comment
in lineitem
. The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.
decimalToDouble
@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.
Sure, I opened a PR in velox to fix the data generated for column l_quantity
. Could you please help review the change?
https://github.com/facebookincubator/velox/pull/11511
Upon checking the Tpch SF1 data generated by Prestissimo and Presto and comparing them against Duckdb and the spec, it looks like there is a parity only for Varchar columns and the column l_quantity
. The data mismatch for other columns in lineitem
noticed before was with tiny
scale factor and is not seen with SF1.
The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.
@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.
The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.
@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.
@majetideepak, apologies for the confusion. This comment was only with respect to the Varchar
columns containing comment
s. After this fix, except for the Varchar
columns with the field comment
, rest of the columns generated with Presto and Prestissimo match for SF1. The incorrect Varchar
columns generated by Duckdb's dbgen would still need to be fixed (for both SF1 and SF 0.01).
For tiny SF, there is a parity between Presto and Prestissimo for the Varchar
columns with the field comment
, and for other lineitem
columns such as l_partkey
and l_suppkey
. However, the data generated with Presto Java matches the spec and the mismatch in Prestissimo appears to be because of Duckdb's dbgen tool. Tiny SF doesn't seem to be supported in Duckdb's Tpch extension; the data generated for tiny SF by Prestissimo is the same as the SF1 data for columns such as l_partkey
and l_suppkey
but with reduced row count. While working on the Tpcds connector as well, we had noticed the minimum scale factor supported by Duckdb was 1
and changes had to be made in Duckdb's dsdgen to generate data with tiny SF. Likely similar changes are needed for dbgen too.
Is it fine if we take up all these dbgen fixes in a separate PR?
@pramodsatya : Yes, lets handle the varchar issues in https://github.com/prestodb/presto/issues/24011
Java
Prestissimo