techascent / tmducken

tech.ml.dataset integration with duckdb
MIT License
58 stars 5 forks source link

Fixed using uuid data type in duckdb #35

Closed lins05 closed 1 month ago

lins05 commented 2 months ago

I encountered an issue that when querying from an existing duckdb table where there is a uuid column, the first character of the returned uuid is always incorrect compared to querying directly from duckdb cli.

But it has a pattern: "8" is changed to "0", and "6" is changed to "e", so the MSB is always toggled.

The change in the PR (toggle the MSB in the code) could fix it, but I'm still not 100% sure what is going wrong 🤷

I added a unit test that writes an uuid value and get it back by querying select ... where id = .., which would fail in current master code.

In duckdb, the uuid is represented as a duckdb_hugeint:

typedef struct {
    uint64_t lower;
    int64_t upper;
} duckdb_hugeint;

First I though it's a bug in Unsafe, since the code uses Unsafe.putLong and Unsafe.getLong to r/w the 8-bytes upper value, could it be the wire implementation of java unsafe code has some bug?

To verify that, I tried to change the Unsafe.putLong to 8 calls of Unsafe.putByte but the data written to duckdb is still incorrect in the same manner!

So I guess the java implementation of UUID could have some issues in constructing and returning the upper 8 Bytes?

harold commented 2 months ago

Surprising and interesting! Thanks for this.

cnuernber commented 2 months ago

This is great and is definitely going in assuming the entire suite of unit tests do still pass. It is odd that we haven't seen this yet as our existing unit tests do include the uuid datatype.

lins05 commented 2 months ago

@cnuernber The existing uuid test only involves read/write both from clj land, so if one inspects the value from the duckdb side you'll find the value written is wrong (MSB is flipped).

The existing test could pass only because it writes to duckdb and read it back, and during the read the MSB is flipped again so clj code could see the correct value.

cnuernber commented 1 month ago

Understood - thanks!!