jermp / sshash

A compressed, associative, exact, and weighted dictionary for k-mers.
MIT License
84 stars 17 forks source link

error with 64-bit hash code #16

Closed PandorasRed closed 2 years ago

PandorasRed commented 2 years ago

hello,

i have this error :

terminate called after throwing an instance of 'std::runtime_error' what(): Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.

and i don't find how i can use a 128-bit hash codes when building

jermp commented 2 years ago

Hello @PandorasRed, if you want to use 128-bit hash codes, pull the latest commit and just uncomment this line https://github.com/jermp/sshash/blob/master/include/util.hpp#L26 (and comment the previous one, of course).

PandorasRed commented 2 years ago

thank you for this, come back to you when its finish to tell you if it work correctly( it take 3days last time before throwing the error)

jermp commented 2 years ago

Three days? How many k-mers do you have?

PandorasRed commented 2 years ago

oh sorry misread my log, it was the precedent process that take the most of time, sshash only take 40min, but it was a lot of kmer the software gave me that with the previous error : num_kmers belonging to buckets of size > 64 and <= 128: 1 873 217 528

and the total number of kmer is : 14 285 321 462

jermp commented 2 years ago

Ok, yeah that makes sense. I've also tried very large inputs, as described here https://github.com/jermp/sshash#large-scale-benchmark. Ok, let me know!

PandorasRed commented 2 years ago

i still have the same error even with the modification you committed:

the full log that the software output is this :

num_written_tuples = 500000000 num_written_tuples = 550000000 num_written_tuples = 600000000 num_written_tuples = 650000000 num_written_tuples = 700000000 num_written_tuples = 750000000 num_written_tuples = 800000000 num_written_tuples = 850000000 num_written_tuples = 900000000 num_written_tuples = 950000000 num_written_tuples = 1000000000 num_written_tuples = 1050000000 num_written_tuples = 1100000000 num_written_tuples = 1150000000 num_written_tuples = 1200000000 num_written_tuples = 1250000000 num_written_tuples = 1300000000 num_written_tuples = 1350000000 num_written_tuples = 1400000000 num_written_tuples = 1450000000 num_written_tuples = 1500000000 num_written_tuples = 1550000000 num_written_tuples = 1600000000 num_written_tuples = 1650000000 num_written_tuples = 1700000000 num_written_tuples = 1750000000 num_written_tuples = 1800000000 num_written_tuples = 1850000000 num_written_tuples = 1900000000 num_written_tuples = 1950000000 num_written_tuples = 2000000000 num_written_tuples = 2050000000 num_written_tuples = 2100000000 num_written_tuples = 2150000000 num_written_tuples = 2200000000 num_written_tuples = 2250000000 num_written_tuples = 2288372586 === step 2: 'build_minimizers' 212.192 [sec] (11.4596 [ns/kmer]) bits_per_offset = ceil(log2(33071374881)) = 35 m_buffer_size 20833333 sorting buffer... saving to file './sshash.tmp.run_1660315183221372230.bucket_pairs.0.bin'... num_singletons 2554790/22645391 (11.2817%) === step 3: 'build_index' 36.8149 [sec] (1.98823 [ns/kmer]) max_num_super_kmers_in_bucket 118558 log2_max_num_super_kmers_in_bucket 17 num_buckets_in_skew_index 7255244/22645391(32.0385%) num_partitions 7 computing partitions... num_kmers belonging to buckets of size > 64 and <= 128: 1873217528

terminate called after throwing an instance of 'std::runtime_error' what(): Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.

jermp commented 2 years ago

Have you uncommented the line that I pointed to you?

PandorasRed commented 2 years ago

yes, i have redo a clean install to be sure and after uncommenting the line it throw the same error

jermp commented 2 years ago

I cannot see why. If you use a different base_hasher as I indicated, the test https://github.com/jermp/sshash/blob/master/include/util.hpp#L282 should fail. Can you check what is the value of this expression sizeof(base_hasher_type::hash_type) * 8 on your end?

jermp commented 2 years ago

Hi @PandorasRed, any news? Otherwise I am closing this. Thank you.

PandorasRed commented 2 years ago

it seems to work now, i come back to you if i need further help

jermp commented 2 years ago

Ok great. Sure, feel free to reopen this issue if you further need my assistance. I’m interested about your use case. -Giulio

PandorasRed commented 2 years ago

hello, sshash has been running for nearly 2 week now and still not finished i think the 128-bit hash is either extremely slow or has a bug that let it run indefinitely, i'm at the end of my contract with the institute that host the server i use so i can't let it run for more time, if you want to try to redo it on your own i was using this dataset https://www.mg-rast.org/mgmain.html?mgpage=project&project=mgp6377 with using cuttlefish to preprocess it

jermp commented 2 years ago

Hello @PandorasRed, I've used SSHash to index up to 18 billions of kmers (k=31), with 128-bit hashes. The construction speed does not change if you use 64- or 128-bit hashes. Where did it stall? At what step? Have you tried on a smaller dataset or a prefix of the current one?

PandorasRed commented 2 years ago

i have used it successefuly on a human genome for testing, but on the big dataset it never finished, i'm gonna attach log file if you want to check what happened, here are the log of the 4 process i launched with 500GB of ram 10cpus

nofiltered version simple.txt the command line used nofiltered version with --check and --bench.txt

filtered version simple.txt filtered version with --check and --bench.txt

jermp commented 2 years ago

Thank you for sharing the log files. Could you also share the files prepared with cuttlefish? A couple of thoughts:

  1. Actually, I see that the "filtered version with --check and --bench" finished correctly. I do not know the difference between the two logs "simple" and "with --check and --bench". Just to clarify: the flag --check performs correctness check and it is expected to be slow for large datasets (and you see that no error was produced, so the built index is correct); the flag --bench gives you some performance numbers for random queries.

  2. The filtered version already contains 18.5 B kmers! :) That's already very large. The unfiltered version contains 43 B kmers. I do not think anyone successfully indexes such a quantity but I think you can with SSHash (see third below).

  3. You see that essentially all the construction time is due to the skew index, which builds several MPHFs. My guess is that you're using a too small m: with so many kmers, you should be using m=20, 21, or 22. See my examples in the README, here https://github.com/jermp/sshash#large-scale-benchmark.

Let me know what you think. Best, -Giulio

PandorasRed commented 2 years ago

the files are 128GB and 81GB so i can't really share them the filtered version with --check and --bench was running since 2 week, the others one since 1 week, and simple its was just to see if it actually finished or if it was the check and bench that was taking time. i'm not working on this projet because of my end of contract but i'm gonna forward this issue to the person that going to work on the project