viega / hatrack

Fast, multi-reader, multi-writer, lockless data structures for parallel programming
Other
81 stars 9 forks source link

Getting "malloc: Double free" error when overwriting keys too often #8

Open mallman opened 1 year ago

mallman commented 1 year ago

I haven't quite nailed down the exact circumstances in which this bug occurs, but I think you can reliably reproduce it by modifying examples/basic.c to add multiple calls to hatrack_dict_put(envp_dict, env_key, env_val) in the same loop iteration when populating envp_dict, like so:

    while (envp[i]) {
        p       = envp[i];
        env_eq  = strchr(p, '=');
        env_key = strndup(p, env_eq - p);
        env_val = strdup(env_eq + 1);

        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_put(envp_dict, env_key, env_val);

        i++;
    }

Obviously, this depends on the number of environment variables. I have about 40. Just add more calls hatrack_dict_put(envp_dict, env_key, env_val) if you still don't trigger the bug.

I think this is related to the value of HATRACK_RETIRE_FREQ, because the bug occurs in the function call to mmm_empty() in mmm_retire() in mmm.c.

I'm running this on an M1 macOS 13.4 with Xcode 14.3.1 clang:

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I configured hatrack from scripts/config-debug.

Thank you!

viega commented 11 months ago

Hey I'm really sorry to have taken so long to notice this. I don't know why I didn't get the notification; I just saw it right now.

My recollection of that API is that when you pass a value into it, it takes ownership of the memory cell passed in by default (unless you provide an ejection callback), and will free it when the item is ejected. If you're putting in the same value multiple times, you need to stick each copy of the value in a different memory cell.

Looking at the example, I would absolutely expect the above loop to lead to a crash even in a single threaded program, because when you insert the memory cell the second time, it marks the first for deletion once it can 'prove' that no other threads might be holding a reference to it.

Hope that's helpful. I'll leave this open for a while and will spend more time on it soon just to double check.

And I'm going to look at my settings right now to make sure I am getting any tickets right away, really sorry about that!

mallman commented 11 months ago

Hi John. No worries. I'm glad to hear from you. I thought the project might be abandoned, and it looks like great work.

Would it be an error to reuse the same key to set different values? I'm not sure the fact that the values are the same in my code sample is germane to my original issue.

I set hatrack aside when I hit this stumbling block. Returning to my test case, I'm no longer getting the "double free" error.

Can you try running the example?

I can see that one thing that's changed on my side since I reported this is I updated my compiler:

Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I just tried again with clang 14.0.3, and still don't get an error. Hmmmm....

viega commented 11 months ago

Absolutely.

I just ran the example with the extra calls to re-add the same environment variable. That does cause double frees, as I would expect:

viega@UpDog examples % ./basic 
Freeing: TMPDIR: /var/folders/6n/n6z_d_2952g_kdzs7bfz4q8w0000gn/T/
Freeing: `A: "
basic(99960,0x1f14e1e00) malloc: *** error for object 0x600000044160: pointer being freed was not allocated
basic(99960,0x1f14e1e00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      ./basic

I don't have an environment variable named A, of course.

If you look at the ejection handler in that file, envp_free_handler, it deletes the memory allocated for both the key and the value at some point after there's an overwrite.

So if you insert the same memory cell multiple times, it'll get ejected multiple times, and it'll free multiple times.

The handler getting called indicates that the ejected cell isn't being referenced, but it doesn't know if you've reused it.

So if you are re-using values, then freeing when the handler gets called is the wrong behavior; you might use an atomic reference count in this situation (with atomic_fetch_add()), bumping the count when inserting, lowering when the handler is called, and only freeing if it gets to 0.

Or, just wait till the table is torn down.

If you don't register an ejection handler, it won't try to free the cells.

As for reusing keys, it's the exact same mechanism-- if you want to be able to reuse them you can. If you don't want to ever free anything during the lifetime of the table, don't worry about registering a handle. But if you do want to free things, then you need to manage how to deal with items that might be duplicates.

mallman commented 11 months ago

Somehow I didn't notice the registration of the ejection handler. This makes sense now.

Obviously, this is not the exact code that led me to file this bug in the first place. If I come across the code that I was having trouble with, I will re-examine it. It's possible I was misusing the API.

Thanks for your help.

mallman commented 11 months ago

BTW, I found out why I was unable to reproduce the issue yesterday. I had commented out the code in hatrack that frees the memory. After undoing that change, I was able to reliably get the double free error.

viega commented 11 months ago

Thanks, if you do have any issues, just let me know. I have made sure this repo gets flagged, so I shouldn't miss it in the future.

mallman commented 11 months ago

Hi John. I found the original code that led me to file this issue, and it's still crashing even though I don't think it should be. I'm going to re-open this ticket, though we may end up with a different one.

I will look at how I can provide a test case that more accurately reflects the problem I'm still having. In the meantime, though, I'll mention the exact place where my code is crashing is at line 625 in

https://github.com/viega/hatrack/blob/1c66f8b0f485d2b0d9e09bd075e6d2c16d46414e/src/hash/dict.c#L617-L628

The problem is that callback is NULL. This is irrespective of whether I assign a free handler or not. The variable dict in this context is not the same as the dict I'm allocating from my program.

I would have to dig deeper to figure out what dict in that function actually is, but can you have a look at that code and verify that dict->free_handler should be non-null as that code assumes?

viega commented 11 months ago

Sounds good, I will take a look (tho maybe not till tonight or in the morning); I assume if you do register a callback it's fine? The callback doesn't need to free anything...

I thought I'd made the callback optional, but my memory could also be hazy, I'm getting older :)

mallman commented 11 months ago

Sounds good, I will take a look (tho maybe not till tonight or in the morning); I assume if you do register a callback it's fine? The callback doesn't need to free anything...

I am setting a callback, yes. As I mentioned though, in the function I referenced above, dict is not the same as the dict my code inits, and dict->free_handler is NULL for some reason... 🤔

I thought I'd made the callback optional, but my memory could also be hazy, I'm getting older :)

Join the club. (BTW, I appreciate the reference to Zork.)

Here's a complete code sample that crashes. It's based on basic.c, but pared down:

#include <hatrack.h>
#include <string.h>
#include <stdio.h>
#include <stdbool.h>

static void
envp_free_handler(hatrack_dict_t *unused, hatrack_dict_item_t *item)
{
    void *key = item->key;
    return;
}

int
main(int argc, char *argv[], char *envp[])
{
    hatrack_dict_t *envp_dict;
    uint64_t        i;
    char           *env_eq; // Pointer to the equals mark.
    char           *env_key;
    char           *env_val;
    char           *p;

    envp_dict = hatrack_dict_new(HATRACK_DICT_KEY_TYPE_CSTR);

    hatrack_dict_set_free_handler(envp_dict,
                                  (hatrack_mem_hook_t)envp_free_handler);

    i = 0;

    while (envp[i]) {
        p       = envp[i];
        env_eq  = strchr(p, '=');
        env_key = strndup(p, env_eq - p);
        env_val = strdup(env_eq + 1);

        hatrack_dict_put(envp_dict, env_key, env_val);
        hatrack_dict_remove(envp_dict, env_key);

        i++;
    }

    hatrack_dict_delete(envp_dict);

    return 0;
}

There are two things in particular I want to point out in this example. First, I'm setting a free handler, but it doesn't actually free anything. However, it does access item. Second, I'm removing the key after I put it.

From what I understand about the API, this should be safe. My code isn't freeing env_key anywhere, including in the free handler. So my expectation is that hatrack shouldn't free env_key either. Further, I'm assuming that it's safe to reference item in the free handler.

(Some detailed API docs would be very welcome, BTW. 😄)

I appreciate your assistance.

viega commented 11 months ago

This is useful, thank you! I'll get to it by morning.

I definitely didn't quite finish things up before going back to work, and that includes documentation :( In addition to docs, I really should strip everything down to what the core library should be, as opposed to the comparative algorithms, and rm the stuff that never got finished.

I'll try to get to that stuff too. I probably will be using this library in a new project again soon, so I'll keep this in mind.

On Mon, Oct 2, 2023 at 1:24 PM Michael Allman @.***> wrote:

Sounds good, I will take a look (tho maybe not till tonight or in the morning); I assume if you do register a callback it's fine? The callback doesn't need to free anything...

I am setting a callback, yes. As I mentioned though, in the function I referenced above, dict is not the same as the dict my code inits, and dict->free_handler is NULL for some reason... 🤔

I thought I'd made the callback optional, but my memory could also be hazy, I'm getting older :)

Join the club. (BTW, I appreciate the reference to Zork.)

Here's a complete code sample that crashes. It's based on basic.c, but pared down:

include #include #include #include

static voidenvp_free_handler(hatrack_dict_t unused, hatrack_dict_item_t item) { void key = item->key; return; } intmain(int argc, char argv[], char envp[]) { hatrack_dict_t envp_dict; uint64_t i; char env_eq; // Pointer to the equals mark. char env_key; char env_val; char p;

envp_dict = hatrack_dict_new(HATRACK_DICT_KEY_TYPE_CSTR);

hatrack_dict_set_free_handler(envp_dict,
                              (hatrack_mem_hook_t)envp_free_handler);

i = 0;

while (envp[i]) {
    p       = envp[i];
    env_eq  = strchr(p, '=');
    env_key = strndup(p, env_eq - p);
    env_val = strdup(env_eq + 1);

    hatrack_dict_put(envp_dict, env_key, env_val);
    hatrack_dict_remove(envp_dict, env_key);

    i++;
}

hatrack_dict_delete(envp_dict);

return 0;

}

There are two things in particular I want to point out in this example. First, I'm setting a free handler, but it doesn't actually free anything. However, it does access item. Second, I'm removing the key after I put it.

From what I understand about the API, this should be safe. My code isn't freeing env_key anywhere, including in the free handler. So my expectation is that hatrack shouldn't free env_key either. Further, I'm assuming that it's safe to reference item in the free handler.

(Some detailed API docs would be very welcome, BTW. 😄)

I appreciate your assistance.

— Reply to this email directly, view it on GitHub https://github.com/viega/hatrack/issues/8#issuecomment-1743444180, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABELGQJDZ4QNFKJENF2KLOLX5L2DLAVCNFSM6AAAAAAZNVRCXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBTGQ2DIMJYGA . You are receiving this because you commented.Message ID: @.***>

mallman commented 11 months ago

Thanks. It's no rush, really. I'm just on a roll now with this hatrack stuff, but I need to put my attention elsewhere, too.

viega commented 11 months ago

Okay, I might take another couple days then, it's turning out to be a busy week :)

mallman commented 5 months ago

Okay, I might take another couple days then, it's turning out to be a busy week :)

Hi @viega. I thought I'd just follow up with you to see how you're feeling about this project.

From my experience, hatrack is wonderful software. Its lack of API documentation makes its adoption more challenging than it need be. Adopters like me will struggle with problems that may come down to misunderstandings about how to use the API. Can you devote some time to shore up the documentation? It would help me debug my own issue here and decide whether this is an issue of misusing the API or an actual bug in hatrack.

Cheers.

(BTW, I'm using hatrack from a Swift project. Bet you never expected a Swift programmer to use hatrack.)

viega commented 5 months ago

@mallman: Thanks for reaching out again. We've been making good use of hatrack ourselves, and I've done a little bit of work on the library, but have done it all in a whole new repo, because of the other things I'm building around it.

To my recollection, I was unable to reproduce your problem, and was using it in a way pretty consistent with what you said.

My biggest problem in this repo had been fighting to be able to build reliably across all the architectures we're using. In the new repo, I migrated to Meson, which has been much easier than my lifelong fights with CMake and Autotools.

If you have meson installed, I believe you should be able to check out: https://github.com/crashappsec/libcon4m

And then I believe you can just run ./dev hash and it should create a hash directory, and build just the hatrack portions (the library it leaves behind will be called libcon4m.a now though; feel free to name it back).

A tiny bit of the other stuff in that repo that isn't built with that target may or may not be interesting to you. Primarily, there's a simple interface for garbage collecting. Right now it's fairly straightforward and it basically gives each thread its own heap, and isn't yet handling cross-heap pointers. But it is meant to help remove the burden of cross-language memory management... just call con4m_gc_thread_collect() when you want to clean up unused bits and bobs.

I'm not sure yet when I'm going to back-port the hatrack bits to here. I think it depends on where the other project ends up, because there are some advantages to having them together while developing, and I don't want to have to worry about syncing two versions of it.

And on documentation, I 100% agree, though do note that, even though we are using it all, I still haven't really announced any of the work.

What exactly would help you documentation wise? I can try to prioritize it.

mallman commented 5 months ago

Hi @viega. I'm getting back to my hatrack code, using the libcon4m codebase. I'm running into some trouble with double-frees, but before I dive deeper into that I'd like to report an issue with c4test. I'll do that in the other repo.

mallman commented 5 months ago

Hi @viega. I have another issue to report for hatrack. Should I report it in this repo or in the libcon4m repo?

Basically, it appears that distinct keys with identical hash values overwrite each other in a dict.

viega commented 5 months ago

Hey there!

First, let me say that it's an intentional (and very common) choice with dictionaries to treat the hash value as a unique stand-in for the key, meaning if there is a collision, then the two keys will be treated as identical.

This is incredibly common with production hash tables because, if you select the right hash algorithm, the odds of it happening are infinitesimal. For instance, using SHA-256 and truncating the result to a 128 bit key would result in a collision on average once every 2^64 hashes.

However, for hash tables, it's common to skip the cryptographic hash function to save cycles. I selected XXH128 to package in because it's incredibly fast and seems to be collision resistant given non-malicious inputs. The hash speed should be about 10x faster than using SHA-256. But SHA-256 isn't actually particularly expensive.

So I guess I'd ask:

  1. Are you using a custom hash?
  2. If you're using the built-in XXH128, can you share colliding inputs? If it's going to be an issue, I'll happily swap out for SHA256. The test harness memoizes the hashes anyway, so as to not measure the impact of the key hash.

Switching to SHA-256 would definitely guarantee that you wouldn't have collisions for all practical purposes (it would essentially require a break of SHA-256 to raise the odds of a collision to any meaningful value).

As for repo, either place is fine. I have set both up to get me notifications :)

mallman commented 5 months ago

Interesting. In the cases of the hash tables I've seen they either claim that distinct keys map to distinct values or I've made that assumption. At least, in the theory of hash tables I've seen there are two common approaches to handling hash collisions—chaining and open addressing. Both ensure that distinct inputs are always treated as distinct keys. If they break their API contract I wonder why they don't just change the API documentation to specify that keys with equal hash values are treated as equal keys? Then there would be no confusion around why a two different keys with the same hash value are treated as the same key.

In my particular case of hash collisions, I'm using a custom hash that returns 0 for every input. This isn't part of any kind of real code. It's part of a unit test to validate assumptions. The assumption I'm trying to validate is that distinct values with identical hash codes are treated as distinct keys in an hatrack hash. Actually, this wasn't one of my original unit tests, because I usually don't test a library API. I caught this serendipitously when I read a line in the source code about keys being compared by hash. That made me want to test this certain assumption of mine.

My unit tests are for my Swift wrapper around hatrack. I'm still getting double-free related crashes for one specific test case: that where I call hatrack_dict_replace for a key that doesn't already exist. But that's a different issue I'm trying to debug.

Thanks for clearing this up, though this does make me a little nervous about using hatrack with custom hashes.

I've seen one hash table implementation which avoided the issue of resolving hash collisions by restricting hash keys to 64-bit integers, and guaranteeing that those keys never collide.

One question... if I use the built-in HATRACK_DICT_KEY_TYPE_INT key type, does hatrack guarantee that distinct integers are treated as distinct keys? Or is that still probabilistic?

Thank you.

mallman commented 5 months ago

For custom keys, I'm using Swift's standard hashing algorithm. That algorithm is not documented by the API, so clients can't know how likely it is for two different keys to have identical hashes.

The trouble with requiring clients to implement a high quality custom hash is that I don't want to make that part of my API. I want clients of my API to use any hash algorithm for Swift data structures so long as identical keys have identical hashes.

To put it another way, I want to make it as easy as possible for clients to use my hatrack wrapper. For custom key types, the easiest thing for clients to do is use the built-in hash value algorithm.

viega commented 5 months ago

Well, the default Hatrack hash table uses open addressing. There's a sophisticated caching mechanism for linear probing I based somewhat loosely on hopscotch hashing (hopscotch won't work for something lock-free directly). The details of which are documented in crown.c at the top of the file..

I do understand the confusion now, though, and I should be able to explain it. But first, let me try to level set us (I'm sure you already understand the below, just want to be clear and build up to the communication issue).

  1. The hash value is a function of the key. Ideally, this would be fast to compute, minimize collisions, and be uniformly distributed.
  2. The hash value is used to determine an initial bucket for consideration. If it's empty, it gets used.
  3. When we 'use' a bucket, we write the hash value, PLUS the key/item.
  4. When we find our initial bucket is already occupied, general practice is to just compare the hash values of the items, NOT do a full equality check of the items. In practice, this is a non-issue, especially w/ 128-bit hash keys and a good algorithm.
  5. There is a second function applied iteratively, until either a) an empty bucket is found, or sometimes b) some threshold is reached that leads us to resize the table. There are a lot of options for that function, including using a second hash function, but linear probing tends to be good due to cache locality. The caching mechanism did give me consistently better modest improvements to performance.

So the major confusion we've had here is that really, we can see TWO types of collisions:

  1. Where two different keys hash to the same initial bucket.
  2. Where two different keys yield the same hash value.

1 is dealt with via the probing strategy (or chaining, but that generally has a big negative impact on performance).

As for #2, I'm sure there are hash tables somewhere that always do a deep-compare of objects, but that's not going to be fast (especially for large objects), and in parallelized environments, is somewhat impractical anyway.

Of course, a type 2 collision is always a type 1 collision too, but type 1 collisions are generally not type 2 collisions ever in practice. Hope that clears things up.

In terms of using int values directly to guarantee no type 2 collisions... you COULD do that, but I wouldn't recommend it, because you'll definitely pay a price. Most hash tables I've seen that started out that way and then saw use at scale ended up changing because it had a noticeable, measurable impact.

The place where linear probing can fail is if your hash values are NOT uniformly distributed. For instance, if you are hashing integers, these tend to NOT be particularly evenly distributed in real world scenarios, leading to lots of collisions and thus reduced performance.

There are alternative strategies. You could also have the upper 64 bits be the integer directly, and have the other 64 bits come from a hash if you like (I have tried that, and the performance is fine as long as you pick the upper bits; if you replace the lower ones you're back to abnormally high collision rates). Or, you could take the number, and AES-encrypt it, which guarantees no collisions. But I'd still expect that to be slower than XXH-128.

But the world does generally feel comfortable with the probabilities. And I know them well in the cryptography world-- for instance, you'll notice that the most common AES mode today, AES-GCM has a probabilistic bound... and also has my name as a co-author.

For instance, let's assume I'm adding 1M hash keys a second continuously for a year. That's 31,536,000 * 1,000,000. Clearly, that's an absolutely unrealistically high load by a huge margin.

Yet, it would take that workload more than half a million years to get to the point where there was even a 50% chance that there were ever two DIFFERENT values used that yielded the same hash key.

In practice, nobody should make a tradeoff based on worrying about collisions on 128-bit values, as long as the hash is trusted enough. There, SHA-256 definitely fits the bill. And, there are plenty of non-cryptographic hashes that are widely deployed for this use case, XXH being high among them.

But, I will admit that XXH being non-cryptographic, it's more reasonable to think the uniformity assumption might not apply. If you really wanted to be conservative, paying the price to use SHA-256 would be worth it, especially if you memoize, and this is something I've definitely considered.

Now, as for trying to provide a simple interface to API clients, I have wrapped Hatrack in languages with strong type systems, and been able to do that. The rest of libcon4m also does the same thing.

Basically, even though it looks like the pointer hash and int hash are different, on 64-bit machines, they are 100% the same. In most contexts, either you have a primitive value like an int or float, or else the memory address of storage is generally going to be a fine stand-in for equality.

That breaks down when you're in an environment where different instantiations need to test as equal when they are structurally equivalent, and when it's not unlikely to have multiple objects that are equal.

The place where that tends to be important is with strings. But whatever the case, it's better to hash the whole object then to compare copies.

That's particularly true because you can memoize the hash. The OBJ variant of the hash function constants is explicitly used to help hatrack find the cached hash (and this facility is used in libcon4m). That way, you only ever touch the whole object once whenever you can memoize it (this makes the most sense for immutable objects, obviously).

So when you use the off-the-shelf options, you are getting one of the two things: XXH3_128bits(&key, sizeof(uint64_t)); OR XXH3_128bits(key, strlen(key));

Now, it is on my (very long) TODO list to allow the length of the key to be a second parameter. For instance, UTF-32 strings currently need either the custom hash option or to use the by-reference compare, because they almost always contain lots of null characters, resulting in strlen() being inappropriate.

But my advice is to automatically select the function based on swift's type system. For instance, in other languages, here's what I did: 1- Default to the hashing of the 64-bit value for integers and the 64-bit pointer for almost every other type. Most people are happy with this most of the time. Again, this is one function for almost everything. 2- Use the other hash function for UTF-8 or ASCII strings. 3- It's perfectly fine if Swift already gives hash values for objects to use those, depending I'd say on how it comes up with them for strings. I'd assume integers and floats might not explicitly have such things, in which case I'd use XXH there to be safe. 4 - It's similarly fine to let people choose their own hash function (which hatrack does support).

As for the hatrack_dict_replace, the functional test for it definitely tests calling it when there is no item there. I just tried a couple of variants, and can't figure out how you might be getting a double free in that scenario (I had previously been thinking you might be freeing it once via ejection handler, and once when it gets returned from replace, but it doesn't sound like that's what's happening).

I'd be happy to find a time to pair with you on it if you did want to look at it together.

And I hope this is all helpful!

mallman commented 5 months ago

Now, as for trying to provide a simple interface to API clients, I have wrapped Hatrack in languages with strong type systems, and been able to do that.

Is this work public? I'd like to see.

That breaks down when you're in an environment where different instantiations need to test as equal when they are structurally equivalent, and when it's not unlikely to have multiple objects that are equal.

In my own use case, I'm using a hash table to canonicalize database entities by primary key (also called "uniq-ing"). The "primary key" in the codebase is actually a pair consisting of a library identifier and database primary key, because the app can work with entities from more than one library database with the same schema concurrently. These identifiers are frequently constructed outside the context of the persistence environment itself. Comparison by value is essential. I'm using the standard Swift mechanism for computing their hash, but I could easily customize that. Also, ATM I am not memo-izing the hash, even though I could. (The objects are immutable, after all.) I don't recall if I've tested various schemes for performance. I may revisit that topic.

Regarding your advice for selecting a hash function:

1- Default to the hashing of the 64-bit value for integers and the 64-bit pointer for almost every other type. Most people are happy with this most of the time. Again, this is one function for almost everything.

I'm doing that for integer dictionaries, using the HATRACK_DICT_KEY_TYPE_INT as dict. I don't have an implementation for objects compared by pointer, because that's an unusual case in Swift. However, it could be done easily.

2- Use the other hash function for UTF-8 or ASCII strings.

I'm converting Swift strings (which are unicode) to UTF-8 C strings (NULL terminated) and using HATRACK_DICT_KEY_TYPE_CSTR as dict.

3- It's perfectly fine if Swift already gives hash values for objects to use those, depending I'd say on how it comes up with them for strings. I'd assume integers and floats might not explicitly have such things, in which case I'd use XXH there to be safe.

The Swift wrapper doesn't hash anything itself. For custom key types, it requires the client to provide their own hash value. It then passes that hash value directly to a HATRACK_DICT_KEY_TYPE_OBJ_CUSTOM dict. This is where I'm counting on clients to provide good hashes, and I'm assuming that Swift computes good hashes by default. Whether they're uniformly distributed is not specified by the Swift API, but the API documentation does provide this nugget:

However, the underlying hash algorithm is designed to exhibit avalanche effects: slight changes to the seed or the input byte sequence will typically produce drastic changes in the generated hash value.

I'm going to take a closer look at my problem with the hatrack_dict_replace.

I'd be happy to find a time to pair with you on it if you did want to look at it together.

That would be fantastic, though I want to spend a little more time doing my own legwork.

And I hope this is all helpful!

This is proving to be very fruitful for me, thank you. I hope you find this helpful as well.

viega commented 5 months ago

Sure, here's a link to a wrapping for Nim:

https://github.com/crashappsec/nimutils/blob/dev/nimutils/crownhash.nim

It was a bit quick and dirty, but it is working well for us.

Many programming languages have an extensible facility for custom types to provide hashes so they can be used as dictionary keys (Python is one where I used to be very familiar with the details of the implementation). While they often use fine algorithms, they do tend to produce 64-bit hash values. Unfortunately, Swift falls in that category, per my googling.

But those languages still do use it for equality testing (as Swift seems to).

This is not in my margin of safety personally. When using 64 bit hashes, no matter how good the function is, you're down to the equivalent of only 32 bits of security (meaning, a collision is expected. That's somewhere near 4B hashes or so... so for most apps the odds are going to be 1 in many millions, but if I can imagine there might be cases that use that many keys, then I am not personally comfortable.

But again, it does seem like Swift's team itself is fine with it, and you're probably fine to use that as the hash key, as they seem to do internally.

Anyway, I'm definitely enjoying discussing, and happy to find time to help whenever. Just let me know when you want to engage.