Need help speeding up large hash map

greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.

https://greg7mdp.github.io/parallel-hashmap/

Apache License 2.0

2.53k stars 239 forks source link

Need help speeding up large hash map #233

Closed Jonahowns closed 8 months ago

Jonahowns commented 8 months ago

Hello,

I am currently using your flat_hash_map in an LSH Forest implementation. My main limitation is that the dataset I want to use is quite large (86 million vectors of 2048 elements each). Unsurprisingly this does not fit in RAM at all. My current implementation has the hash map split up into multiple files ~6gb each. My parallel hash map is type: phmap::flat_hash_map<std::vector, std::vector>. Each individual hash map is serialized and deserialized exactly as you've written previously in examples/serialize.cc.

Ideally I could load these files much faster than my current implementation (~4s per file). I can't easily change the types away from vectors, which would allow me to use the dump/load in your implementation. I took a look at doing it myself but am pretty out of my depth. Could you either a) help me dump/load a flat_hash_map using vectors or b) suggest an alternative?

Please let me know what information I could provide that would help. Thank You:)

greg7mdp commented 8 months ago

Hi @Jonahowns , thank you for using phmap.

You mention that you use vectors of 2048 elements each. If the size is constant, then you can use std::array<uint32_t, 2048> instead of std::vector<uint32_t>.

The std::array is a trivially_copyable type, so it would work fine with phmap_dump. That would take care of the values.

But you mention that the key of your flat_hash_map is std::vector<uint8_t>. Do you really need a std::vector here? What does the key represent?

Jonahowns commented 8 months ago

I apologize for being misleading, while the size of the original data is 2048 for one of my datasets that is not the size of the values in the flat_hash_map. The flat_hash_map values contain the index of the data which matches the particular hash given by the key. This requires I use something dynamically allocated such as a vector or smart pointer. Either way it is not trivially_copyable.

I am not 100% clear on the implementation details as I am modifying someone else's code. I think the key does need to be a std::vector<uint8_t> as it is a hashed version of the data whose size is dependent both on the data size and number of "buckets" used.

As far as I can tell in the code, there is no other datatype that would work for me and be trivially_copyable.

greg7mdp commented 8 months ago

No problem. Is the code open source? I'd like to have a look if possible.

The flat_hash_map values contain the index of the data which matches the particular hash given by the key. This requires I use something dynamically allocated such as a vector or smart pointer.

I'm not quite understanding why an index has to be a vector?

Jonahowns commented 8 months ago

Sure, the original implementation is here and uses the sparsepp sparse hash map. My implementation has updated the code to use your parallel hash map with minimal modifications. The goal I have is to be able to quickly save/load the individual tables in the hashtables_ member.

Doesn't the index have to be a vector or other dynamically allocated storage because the size of the index is determined at runtime?

Jonahowns commented 8 months ago

I ended up manually writing out the contents to a couple of binary files and using those directly. It is much faster than the previous implementation and requires 0 serialization/de-serialization at runtime.

greg7mdp commented 8 months ago

Oh, man, I'm sorry, I meant to look into this and forgot all about it. Glad you found a good solution to your problem! Don't hesitate to ask again if you have a question, I'll try my best to be more responsive.