Closed tustvold closed 1 month ago
So, this kind of API is something we do want to offer on HashTable
in the future, but isn't at the moment. I agree with you that the ability to store literally nothing besides the control buffer (indicating which slots are filled) in the hash table is reasonable, and that feels like a use case we want to support.
Thank you for the extremely fast response! How would you recommend we proceed then, we could update to 0.15.0 using the raw-entry feature, but if this is something that is going to be removed this feels like it is just kicking the can down the road... Given how common hashmaps are, and how much codegen is associated with them, I'd very much like to avoid a situation where we end up stuck on an old version for a prolonged period of time, and consequently bloat our user's binaries and compile times.
Perhaps you could give some indication of what the timeline for raw-entry's removal looks like, will it be gone in the next release, or in 2 years time 😅
@tustvold Your use case seems a lot like what indexmap
does with HashTable<usize>
now, and RawTable<usize>
before. I think you could do it the same way as that with your multiple different key lookups, no?
I agree with you that the ability to store literally nothing besides the control buffer (indicating which slots are filled) in the hash table is reasonable, and that feels like a use case we want to support.
I don't think that's what they're doing -- they also insert an index associated with each slot. You need something else, because the 7 bits of the hash are insufficient.
I think you could do it the same way as that with your multiple different key lookups, no?
How does one override the Hash and Equality to use the values stored externally, as opposed to the usize
itself? I'm afraid I am not familiar with the internals of indexmap to which you refer
HashTable
methods also require you to provide the hash value and equality methods manually. If anything, this is a lot safer than trying to do it with HashMap<usize, (), ()>
where you might accidentally use the wrong thing, although S = ()
is a good safety net.
The main part to look at in indexmap
is here:
https://github.com/indexmap-rs/indexmap/blob/master/src/map/core.rs
But you should also look through the HashTable
API directly - it's not big:
https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html
Oh, that looks perfect. I wasn't aware of that new API, thank you
Edit: That worked like a charm - thank you for the pointers - https://github.com/apache/arrow-rs/pull/6537
Perhaps the deprecation note on the raw-entry feature flag could point to the HashTable API to make it more discoverable, happy to file a PR if you think this is worthwhile? E.g. changing # Enables the deprecated RawEntry API.
to # Enables the RawEntry API deprecated in favour of HashTable
or something. It wasn't obvious to me that HashTable
is intended to replace this, and even though scrolling back you mentioned HashTable
my eyes auto-corrected this to HashMap
.
Edit: although perhaps then people will miss entry_ref and similar... Perhaps what is really needed is a migration guide...
Yeah, the main purpose of deprecating RawTable
is so we can effectively completely get rid of it, and make HashTable
the one good implementation. At least this is hopefully clear in the changelog:
This update contains breaking changes that remove the
raw
API with the hope of centralising on theHashTable
API in the future. You can follow the discussion and progress in #545 to discuss features you think should be added to this API that were previously only possible on theraw
API.
I will start by saying that the removal of raw_entry makes a lot of sense, and forcing code to instead use some combination entry_ref and Equivalent newtypes is definitely an improvement where possible. I was not aware of these APIs and forcing me to use them has improved my code :+1:. However, there is one use-case that is currently possible with raw_entry_mut, but doesn't appear to be possible with the remaining APIs.
In arrow-rs we want to intern strings, with the interned strings allocated from a contiguous buffer - see here. This precise formulation is perhaps somewhat esoteric, but I don't think it that unusual to want to maintain an index on top of some separate data structure that stores the actual records.
For example, you might have a list of records and want to provide efficient lookups based on two or more different attributes without duplicating the record data. A naive way to achieve this would be to maintain a
Vec<Record>
and then two or moreHashMap<K1, usize>
andHashMap<K2, usize>
, where theusize
value is the index in the recordsVec
of theRecord
. However, this approach duplicates the key data into each of the hashmaps, which could be prohibitively expensive.The trick / hack arrow-rs uses is to exploit the ability of raw_entry_mut to provide the hash and equality function at point of use, see here. This allows storing the data separately, without introducing self-referential borrows between the hashmap entries and the value storage.
I'm not really sure what the solution is here, and it is perfectly reasonable for this to be considered beyond the scope of this crate, but I thought I would at least articulate the use-case.