vigna / sux

Succinct data structures in C/C++
Other
82 stars 17 forks source link

Recsplit core dump on large keyfiles #3

Closed weaversa closed 5 years ago

weaversa commented 5 years ago

Here I try to build an MPFH for 2^30 keys using recsplit. I'm running on an Amazon EC2 instance of type m5.4xlarge. Such an instance has 16 cores and 64 GiB of memory.

$ wc -l test.keys 
1073741823 test.keys
$ bin/recsplit_load_8 test.keys test.mphf
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
vigna commented 5 years ago

So the map was built. I'd say your key file is too large (but you don't know the size). Why do you want to load the entire file in memory?

weaversa commented 5 years ago

I'm sorry. I was a bit misleading. I should have said I was trying to query such an MPHF. You are right that the MPFH building tool claims success, and recsplit_load gives an abort. Maybe the map produced by the build tool is bad? Maybe the query tool processes keys in a less memory efficient way than the build tool? I'm not familiar enough with the project to say why, just report the abort. I feel that the list of keys I used is agnostic to the bug, so long as the list is large. If you are unable to reproduce, I am happy to send along my list of keys to assist diagnosis.

Thank you!

Here is the full run:

$ wc -l test.keys 
1073741823 test.keys
$ bin/recsplit_dump_8 test.keys 100 test.mphf
Building...
Elias-Fano cumul sizes:  7.469788 bits/bucket
Elias-Fano cumul bits:   6.376042 bits/bucket
Elias-Fano cumul sizes:  0.074698 bits/key
Elias-Fano cumul bits:   0.063760 bits/key
Rice-Golomb descriptors: 1.656336 bits/key
Total bits:              1.794794 bits/key
Construction time: 2078.431 s, 1936 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
vigna commented 5 years ago

On 8 Nov 2019, at 16:52, weaversa notifications@github.com wrote:

I'm sorry. I was a bit misleading. I should have said I was trying to query such an MPHF. You are right that the MPFH building tool claims success, and recsplit_load gives an abort. Maybe the map produced by the build tool is bad? Maybe the query tool processes keys in a less memory efficient way than the build tool? I'm not familiar enough with the project to say why, just report the abort. I feel that the list of keys I used is agnostic to the bug, so long as the list is large. If you are unable to reproduce, I am happy to send along my list of keys to assist diagnosis.

No, the problem is that recsplit_load loads the whole keyset into memory. The problem is not the map, it's the keys. Use shuf to shuffle the keys, head to extract 10 million keys from the shuffled file and do your benchmarking with that...

Ciao,

                seba