Implement xxhash non-zero seed support

nyurik commented 8 months ago

I am not certain this is needed, but if anyone knows more about this, and finds it useful, putting down some thoughts on how it can be done.

The xxhash functions xxh32, xxh64, xxh3_64, xxh3_128 support, and in case of xxh32 and xxh64 require (in their underlying implementation) to provide a seed value (u32 for xxh32, and u64 for the rest). Currently, the seed value is always set to 0.

Some users may need to have a seed value at least for xxh32, xxh64 for them to match expectations. For that, we could provide additional non-zero-seeded SQL functions:

SEEDED_XXH32(seed, value1, value2, ...) - the first argument is treated as an integer seed value, the rest are processed as with all other scalar hashing functions. NULL will be returned if seed value is NULL, regardless of the rest of the values.
SEEDED_XXH32_CONCAT(seed, column1, column2, ...) - aggregated/window function, will only process the seed value on the first row, and ignore it on the subsequent ones. Not sure this is what the users will expect though.

gmottajr commented 8 months ago

Hi there! 👋

I've been pondering over the xxhash function enhancements you mentioned, not sure if I totally got your point, as my first thoughts, at the top of my head, I would come with something like ...

What if we introduce a unified function, let's call it seeded_xx, which can handle both xxh32 and xxh64 hashing? This function could take an enum parameter to specify the desired hash type. 🤔

Here's a quick breakdown of the idea:

Unified Function: seeded_xx serves as a one-stop-shop for hashing, reducing the complexity for the end-user.
Hash Type Enum: We can define an enum, say HashType, with variants like XX32 and XX64 and the others. This enum will dictate which hashing algorithm seeded_xx should use. 🧩
Flexible and Maintainable: This approach encapsulates the variations between the 32-bit and 64-bit implementations, making our codebase more flexible and easier to maintain. 🛠️

Here’s a rough sketch of what the implementation might look like in Rust:

enum HashFunctionType {
    XXH32,
    XXH64,
    XXH3_64,
    XXH3_128,
}

fn hash_with_function(hash_function: HashFunctionType, data: &[u8]) -> u128 {
    match hash_function {
        HashFunctionType::XXH32 => {
            let seed: u32 = 0;  // Hard-coded seed for XXH32
            xxh32::hash_with_seed(data, seed) as u128
        },
        HashFunctionType::XXH64 => {
            let seed: u64 = 0;  // Hard-coded seed for XXH64
            xxh64::hash_with_seed(data, seed) as u128
        },
        HashFunctionType::XXH3_64 => {
            let seed: u64 = 0;  // Hard-coded seed for XXH3_64
            xxh3_64::hash_with_seed(data, seed) as u128
        },
        HashFunctionType::XXH3_128 => {
            let seed: u128 = 0;  // Hard-coded seed for XXH3_128
            xxh3_128::hash_with_seed(data, seed)
        },
    }
}

// Usage examples
let hash32 = seeded_xx(HashType::XX32, 12345, b"hello world");
let hash64 = seeded_xx(HashType::XX64, 12345, b"hello world");

A few points to consider: Type Conversion: For XX32, we'll need to handle the conversion of the seed from u64 to u32. Performance: We should ensure this abstraction doesn’t introduce any significant overhead, given xxhash's reputation for speed. 🚀

I believe this approach keeps things user-friendly while maintaining the flexibility and robustness of our implementation. What do you all think? Looking forward to your feedback and insights! 🌟

Cheers,

nyurik commented 8 months ago

thx @gmottajr for the suggestion.

From the implementation perspective, we clearly should unifie the design, just like i did with most other implementation here - e.g. there is a shared code to instantiate hasher, keep track on the current state when using aggregate hasher, and finalize it - all done via Digest trait.

From the usage perspective, it would break a lot of common practices, without really gaining us much. It is easier to write SELECT MD5('foo') rather than SELECT HASH(MD5, 'foo') -- enums are not very good in the SQL. The same goes for the seeder variants etc - easier to create dedicated functions. Besides, it would probably be a string 'MD5' - so we would have to do some string matching before we even get to the hashing part - not much value especially because various tooling tend to support SQL validation, and can tell if the function is there, but won't tell if a parameter is incorrect. So in short - far more problems without any real benefit.

gmottajr commented 8 months ago

Hi Yuri, Thank you for your feedback on my suggestion. That makes sense, I understand when you mention the significance of ease of use in SQL contexts and the importance of compatibility with existing tooling and practices. I got it, you have a point on your perspective on maintaining straightforward SQL statements and avoiding complications with enums and string matching.

Anyway, I'm eager to contribute to the project in ways that align with its current needs and goals. In case you want to proceed with this change or If there are other areas where you think my skills could be useful, please let me know. I'm open to any suggestions or areas where my efforts could add value. Best regards,

nyurik commented 8 months ago

@gmottajr thank you, glad to see so much interest in my projects :) As a simple starter, perhaps try https://github.com/maplibre/martin/issues/1093 ? At least the first two items are very easy to do. Let me know if you run into any problems. You can also ping me on OSM-US Slack (see link in any MapLibre's or Martin readme)

nyurik / sqlite-hashes

Implement xxhash non-zero seed support #9