Open spkrka opened 9 years ago
Sounds like a good strategy, unfortunately I don't have any leftover time either. Perhaps we should offer this up as something fun for algorithmically inclined people to have a go at? We could give access to a machine and a test dataset.
Yes, if anyone is randomly reading the issues here, feel free to implement this and submit a pull-request :)
I don't think a test dataset or machine is necessary, you can just generate a fake dataset and make max-memory configurable.
Did this ever get resolved?
No, this has not been implemented yet.
Please note this on Read-Me file that hash index needs to be lower than available memory.
Done!
Currently the hash writing requires malloc of the entire index size. If this is larger than available memory, the operation fails.
We have a use case requiring more than 1.65 billion entries, which results in a 32 GB+ hash index. This is larger than the available memory on the machine that builds it.
We could replace the malloc with mmap:ing of the file itself but that would lead to a lot of random reads and writes from disk which would be horribly slow.
I have an idea to reduce this limitation, but no time to do it at the moment.
For sorting the large set of entries, split it into reasonably small chunks and sort those in memory. Then run a tree of sequential file merges.