HashMap With External Storage

tustvold commented 1 month ago

I will start by saying that the removal of raw_entry makes a lot of sense, and forcing code to instead use some combination entry_ref and Equivalent newtypes is definitely an improvement where possible. I was not aware of these APIs and forcing me to use them has improved my code :+1:. However, there is one use-case that is currently possible with raw_entry_mut, but doesn't appear to be possible with the remaining APIs.

In arrow-rs we want to intern strings, with the interned strings allocated from a contiguous buffer - see here. This precise formulation is perhaps somewhat esoteric, but I don't think it that unusual to want to maintain an index on top of some separate data structure that stores the actual records.

For example, you might have a list of records and want to provide efficient lookups based on two or more different attributes without duplicating the record data. A naive way to achieve this would be to maintain a Vec<Record> and then two or more HashMap<K1, usize> and HashMap<K2, usize>, where the usize value is the index in the records Vec of the Record. However, this approach duplicates the key data into each of the hashmaps, which could be prohibitively expensive.

The trick / hack arrow-rs uses is to exploit the ability of raw_entry_mut to provide the hash and equality function at point of use, see here. This allows storing the data separately, without introducing self-referential borrows between the hashmap entries and the value storage.

I'm not really sure what the solution is here, and it is perfectly reasonable for this to be considered beyond the scope of this crate, but I thought I would at least articulate the use-case.

clarfonthey commented 1 month ago

So, this kind of API is something we do want to offer on HashTable in the future, but isn't at the moment. I agree with you that the ability to store literally nothing besides the control buffer (indicating which slots are filled) in the hash table is reasonable, and that feels like a use case we want to support.

tustvold commented 1 month ago

Thank you for the extremely fast response! How would you recommend we proceed then, we could update to 0.15.0 using the raw-entry feature, but if this is something that is going to be removed this feels like it is just kicking the can down the road... Given how common hashmaps are, and how much codegen is associated with them, I'd very much like to avoid a situation where we end up stuck on an old version for a prolonged period of time, and consequently bloat our user's binaries and compile times.

Perhaps you could give some indication of what the timeline for raw-entry's removal looks like, will it be gone in the next release, or in 2 years time 😅

cuviper commented 1 month ago

@tustvold Your use case seems a lot like what indexmap does with HashTable<usize> now, and RawTable<usize> before. I think you could do it the same way as that with your multiple different key lookups, no?

I agree with you that the ability to store literally nothing besides the control buffer (indicating which slots are filled) in the hash table is reasonable, and that feels like a use case we want to support.

I don't think that's what they're doing -- they also insert an index associated with each slot. You need something else, because the 7 bits of the hash are insufficient.

tustvold commented 1 month ago

I think you could do it the same way as that with your multiple different key lookups, no?

How does one override the Hash and Equality to use the values stored externally, as opposed to the usize itself? I'm afraid I am not familiar with the internals of indexmap to which you refer

cuviper commented 1 month ago

HashTable methods also require you to provide the hash value and equality methods manually. If anything, this is a lot safer than trying to do it with HashMap<usize, (), ()> where you might accidentally use the wrong thing, although S = () is a good safety net.

The main part to look at in indexmap is here: https://github.com/indexmap-rs/indexmap/blob/master/src/map/core.rs

But you should also look through the HashTable API directly - it's not big: https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html

tustvold commented 1 month ago

Oh, that looks perfect. I wasn't aware of that new API, thank you

Edit: That worked like a charm - thank you for the pointers - https://github.com/apache/arrow-rs/pull/6537

tustvold commented 1 month ago

Perhaps the deprecation note on the raw-entry feature flag could point to the HashTable API to make it more discoverable, happy to file a PR if you think this is worthwhile? E.g. changing # Enables the deprecated RawEntry API. to # Enables the RawEntry API deprecated in favour of HashTable or something. It wasn't obvious to me that HashTable is intended to replace this, and even though scrolling back you mentioned HashTable my eyes auto-corrected this to HashMap.

Edit: although perhaps then people will miss entry_ref and similar... Perhaps what is really needed is a migration guide...

clarfonthey commented 1 month ago

Yeah, the main purpose of deprecating RawTable is so we can effectively completely get rid of it, and make HashTable the one good implementation. At least this is hopefully clear in the changelog:

This update contains breaking changes that remove the raw API with the hope of centralising on the HashTable API in the future. You can follow the discussion and progress in #545 to discuss features you think should be added to this API that were previously only possible on the raw API.

rust-lang / hashbrown

HashMap With External Storage #573