trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.31k stars 2.97k forks source link

Varbinary bit manipulations #23682

Open brunokim opened 1 week ago

brunokim commented 1 week ago

tl;dr: can bitwise_* functions be overloaded to varbinary types?

I wanted to generate an UUIDv4 from the first 128 bits of the output of a SHA256 hash. This requires some bit manipulations according to RFC4122:

To accomplish that with Trino, I managed to do the following:

  1. Split the varbinary into 4 bigints, each containing only 32 bits of the input and left-padded with 32 zeros.
  2. Use bitwise_and and bitwise_or to reset the affected bits, and then set the desired values.
  3. Extract the lower half of bigints and concat them to recreate a 128-bit varbinary.

The reason to use 32 bits within a 64-bit integer is to be able to set the MSB, perhaps because hexadecimals are always considered positive? In the commented code below is my attempt to use full 64 bits, which fails because the MSB is set on the integer literals.

WITH
t1 AS (
    SELECT sha256(to_utf8('trino 🐰')) AS hash
),
t2 AS (
    SELECT
        hash
        -- Using int32
        , from_big_endian_64(X'0000 0000' || substr(hash, 1, 4)) AS octets_0_to_3
        , from_big_endian_64(X'0000 0000' || substr(hash, 5, 4)) AS octets_4_to_7
        , from_big_endian_64(X'0000 0000' || substr(hash, 9, 4)) AS octets_8_to_11
        , from_big_endian_64(X'0000 0000' || substr(hash, 13, 4)) AS octets_12_to_15
        -- Using int64
        , from_big_endian_64(substr(hash, 1, 8)) AS octets_0_to_7
        , from_big_endian_64(substr(hash, 9, 8)) AS octets_8_to_15
    FROM t1
),
t3 AS (
    SELECT
        hash
        -- Using int32
        , octets_0_to_3
        , bitwise_or(bitwise_and(octets_4_to_7, 0xFFFF_0FFF), 0x0000_4000) AS octets_4_to_7
        , bitwise_or(bitwise_and(octets_8_to_11, 0x3FFF_FFFF), 0x8000_0000) AS octets_8_to_11
        , octets_12_to_15
        -- Using int64
        --, bitwise_or(bitwise_and(octets_0_to_7, 0xFFFF_FFFF_FFFF_0FFF), 0x0000_0000_0000_4000) AS octets_0_to_7
        --, bitwise_or(bitwise_and(octets_8_to_15, 0x3FFF_FFFF_FFFF_FFFF), 0x8000_0000_0000_0000) AS octets_8_to_15
    FROM t2
),
t4 AS (
    SELECT
        hash
        -- Using int32
        , substr(to_big_endian_64(octets_0_to_3), 5, 4)
        || substr(to_big_endian_64(octets_4_to_7), 5, 4)
        || substr(to_big_endian_64(octets_8_to_11), 5, 4)
        || substr(to_big_endian_64(octets_12_to_15), 5, 4)
        AS uuid_octets_1
        -- Using int64
        --, to_big_endian_64(octets_0_to_7) || to_big_endian_64(octets_8_to_15)
        --AS uuid_octets_2
    FROM t3
),
t5 AS (
    SELECT
        to_hex(hash) AS hash_hex
        , CAST(CAST(uuid_octets_1 AS uuid) AS varchar) AS uuidv4_1
        --, CAST(CAST(uuid_octets_2 AS uuid) AS varchar) AS uuidv4_2
    FROM t4
)
SELECT * FROM t5;

This is the result

hash_hex: 13C1C0C6 818A 593E 5016 F2A4B463937B4592500D83F3F3B3134BF0CCA501F0B2
uuidv4_1: 13c1c0c6-818a-493e-9016-f2a4b463937b

I wish bitwise_and and bitwise_or were available to varbinary types, so this could have been only

bitwise_or(
  bitwise_and(
    substr(sha256(to_utf8('trino 🐰')), 1, 16),
    X'FFFF FFFF FFFF 0FFF 3FFF FFFF FFFF FFFF'),
  X'0000 0000 0000 4000 8000 0000 0000 0000')

Some more feature requests I surfaced from this exercise:

mosabua commented 3 days ago

It would be great if you can send pull requests for the documentation improvements.

With regards to the function I think it would be best for @martint to chime in and confirm if this would be okay to implement. If yes, then you could also go ahead and send a PR for that ideally.

brunokim commented 3 days ago

I'd love to contribute! I may be able to work on this in November. Can you point me where are the docs and bitwise implementation, in the meantime?

mosabua commented 3 days ago

Look at BitwiseFunctions.java in io.trino.operator.scalar

Also https://trino.io/development/process

wendigo commented 2 days ago

The open question is how these functions are meant to work if the both varbinary have different lengths, i.e. 000011 & 0010100000

martint commented 2 days ago

We should add support for varbinary, but I would start conservatively and disallow varbinaries of different lengths. We can expand later. In particular, we're going to have to look into how SQL does conversions between varbinary(n) with different n (currently, Trino only supports varbinary without length indicator). The semantics of the bitwise_* functions should be consistent with those behaviors.