prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.92k stars 5.33k forks source link

Result mismatches between Presto java and native for UUID types #23311

Open aditi-pandit opened 1 month ago

aditi-pandit commented 1 month ago

Queries with UUID have different results in Java vs Native

Your Environment

Expected Behavior

UUIDs are typically stored as VARCHAR or VARBINARY values in HMS/DWRF/Parquet etc. They are cast as UUID in subsequent SQL.

Presto Java uses Java UUID to represent UUID, whereas the block representation for native is int128. When getting results from presto-cli the order of bytes is flipped. This leads to misrepresentations.

Steps to Reproduce

presto:tpch_old> CREATE TABLE a AS SELECT c_uuid  FROM ( VALUES (null), ('33355449-2c7d-43d7-967a-f53cd23215ad'), ('eed9f812-4b0c-472f-8a10-4ae7bff79a47'), ('f768f36d-4f09-4da7-a298-3564d8f3c986')) AS x (c_uuid)

Presto Java 
select CAST(c_uuid as uuid) from a;
                _col0                 
--------------------------------------
 NULL                                 
 33355449-2c7d-43d7-967a-f53cd23215ad 
 eed9f812-4b0c-472f-8a10-4ae7bff79a47 
 f768f36d-4f09-4da7-a298-3564d8f3c986 
(4 rows)

Presto C++

select CAST(c_uuid as uuid) from a;
                _col0     
--------------------------------------
 NULL          
d7437d2c-4954-3533-ad15-32d23cf57a96
2f470c4b-12f8-d9ee-479a-f7bfe74a108a
a74d094f-6df3-68f7-86c9-f3d8643598a2

SELECT CAST(CAST(c_uuid AS uuid) AS VARCHAR) returns the same results however.
                _col0                 
--------------------------------------
 NULL                                 
 33355449-2c7d-43d7-967a-f53cd23215ad 
 eed9f812-4b0c-472f-8a10-4ae7bff79a47 
 f768f36d-4f09-4da7-a298-3564d8f3c986 
(4 rows)
aditi-pandit commented 1 month ago

@mbasmanova @amitkdutta @tdcmeehan

amitkdutta commented 1 month ago

Cc @kgpai @kagamiori

aditi-pandit commented 1 month ago

The fix could be in the PrestoSerializer/Deserializer for UUID to use the byte-ordering needed by the Presto Java side. Prototyping a fix.

tdcmeehan commented 1 month ago

I wonder if this is a correctness bug, since the encodings of UUIDs are a well-defined format: https://datatracker.ietf.org/doc/html/rfc9562. i.e. if anyone relies on comparing on persisted UUIDs, this might be incorrect since it seems the byte order is revered on either the Java or C++ side?

BryanCutler commented 1 month ago

Hi @mohsaka , please let me know if there was anything I could help out with this issue, thanks.

mohsaka commented 1 month ago

Hi @BryanCutler Feel free to take over the issue. I am not actively working on it.

mohsaka commented 4 hours ago

So I think the underlying issue is actually with HUGEINT in presto that was never discovered since HUGEINT isn't a type in presto.

@BryanCutler Pointed out that its actually not reversed but rather reversed and swapped 64 bit words.

The main issue is that the HUGEINT type in presto is represented by an array of longs in the form of [MSW, LSW]. However in Velox, we store it in the form of Little Endian, [LSW,MSW].

Therefore when we send it over to presto, we are sending it over as an array of two longs or 64 bit words. Which means [LSW,MSW]. Then the serializer takes into account the endian reversal which results in us having [Reversed LSW, Reversed MSW].

So I recommended that he fix the HUGEINT issue first, then fix the reversal issue by putting the proper HUGEINT decimal value in so that when the endian is flipped we get the proper value in Java presto.