Closed PandorasRed closed 2 years ago
Hello @PandorasRed, if you want to use 128-bit hash codes, pull the latest commit and just uncomment this line https://github.com/jermp/sshash/blob/master/include/util.hpp#L26 (and comment the previous one, of course).
thank you for this, come back to you when its finish to tell you if it work correctly( it take 3days last time before throwing the error)
Three days? How many k-mers do you have?
oh sorry misread my log, it was the precedent process that take the most of time, sshash only take 40min, but it was a lot of kmer the software gave me that with the previous error : num_kmers belonging to buckets of size > 64 and <= 128: 1 873 217 528
and the total number of kmer is : 14 285 321 462
Ok, yeah that makes sense. I've also tried very large inputs, as described here https://github.com/jermp/sshash#large-scale-benchmark. Ok, let me know!
i still have the same error even with the modification you committed:
the full log that the software output is this :
num_written_tuples = 500000000 num_written_tuples = 550000000 num_written_tuples = 600000000 num_written_tuples = 650000000 num_written_tuples = 700000000 num_written_tuples = 750000000 num_written_tuples = 800000000 num_written_tuples = 850000000 num_written_tuples = 900000000 num_written_tuples = 950000000 num_written_tuples = 1000000000 num_written_tuples = 1050000000 num_written_tuples = 1100000000 num_written_tuples = 1150000000 num_written_tuples = 1200000000 num_written_tuples = 1250000000 num_written_tuples = 1300000000 num_written_tuples = 1350000000 num_written_tuples = 1400000000 num_written_tuples = 1450000000 num_written_tuples = 1500000000 num_written_tuples = 1550000000 num_written_tuples = 1600000000 num_written_tuples = 1650000000 num_written_tuples = 1700000000 num_written_tuples = 1750000000 num_written_tuples = 1800000000 num_written_tuples = 1850000000 num_written_tuples = 1900000000 num_written_tuples = 1950000000 num_written_tuples = 2000000000 num_written_tuples = 2050000000 num_written_tuples = 2100000000 num_written_tuples = 2150000000 num_written_tuples = 2200000000 num_written_tuples = 2250000000 num_written_tuples = 2288372586 === step 2: 'build_minimizers' 212.192 [sec] (11.4596 [ns/kmer]) bits_per_offset = ceil(log2(33071374881)) = 35 m_buffer_size 20833333 sorting buffer... saving to file './sshash.tmp.run_1660315183221372230.bucket_pairs.0.bin'... num_singletons 2554790/22645391 (11.2817%) === step 3: 'build_index' 36.8149 [sec] (1.98823 [ns/kmer]) max_num_super_kmers_in_bucket 118558 log2_max_num_super_kmers_in_bucket 17 num_buckets_in_skew_index 7255244/22645391(32.0385%) num_partitions 7 computing partitions... num_kmers belonging to buckets of size > 64 and <= 128: 1873217528
terminate called after throwing an instance of 'std::runtime_error' what(): Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.
Have you uncommented the line that I pointed to you?
yes, i have redo a clean install to be sure and after uncommenting the line it throw the same error
I cannot see why. If you use a different base_hasher as I indicated, the test
https://github.com/jermp/sshash/blob/master/include/util.hpp#L282 should fail.
Can you check what is the value of this expression sizeof(base_hasher_type::hash_type) * 8
on your end?
Hi @PandorasRed, any news? Otherwise I am closing this. Thank you.
it seems to work now, i come back to you if i need further help
Ok great. Sure, feel free to reopen this issue if you further need my assistance. I’m interested about your use case. -Giulio
hello, sshash has been running for nearly 2 week now and still not finished i think the 128-bit hash is either extremely slow or has a bug that let it run indefinitely, i'm at the end of my contract with the institute that host the server i use so i can't let it run for more time, if you want to try to redo it on your own i was using this dataset https://www.mg-rast.org/mgmain.html?mgpage=project&project=mgp6377 with using cuttlefish to preprocess it
Hello @PandorasRed, I've used SSHash to index up to 18 billions of kmers (k=31), with 128-bit hashes. The construction speed does not change if you use 64- or 128-bit hashes. Where did it stall? At what step? Have you tried on a smaller dataset or a prefix of the current one?
i have used it successefuly on a human genome for testing, but on the big dataset it never finished, i'm gonna attach log file if you want to check what happened, here are the log of the 4 process i launched with 500GB of ram 10cpus
nofiltered version simple.txt the command line used nofiltered version with --check and --bench.txt
filtered version simple.txt filtered version with --check and --bench.txt
Thank you for sharing the log files. Could you also share the files prepared with cuttlefish? A couple of thoughts:
Actually, I see that the "filtered version with --check and --bench" finished correctly.
I do not know the difference between the two logs "simple" and "with --check and --bench".
Just to clarify: the flag --check
performs correctness check and it is expected to be slow for large datasets (and you see that no error was produced, so the built index is correct); the flag --bench
gives you some performance numbers for random queries.
The filtered version already contains 18.5 B kmers! :) That's already very large. The unfiltered version contains 43 B kmers. I do not think anyone successfully indexes such a quantity but I think you can with SSHash (see third below).
You see that essentially all the construction time is due to the skew index, which builds several MPHFs. My guess is that you're using a too small m: with so many kmers, you should be using m=20, 21, or 22. See my examples in the README, here https://github.com/jermp/sshash#large-scale-benchmark.
Let me know what you think. Best, -Giulio
the files are 128GB and 81GB so i can't really share them the filtered version with --check and --bench was running since 2 week, the others one since 1 week, and simple its was just to see if it actually finished or if it was the check and bench that was taking time. i'm not working on this projet because of my end of contract but i'm gonna forward this issue to the person that going to work on the project
hello,
i have this error :
terminate called after throwing an instance of 'std::runtime_error' what(): Using 64-bit hash codes with more than 2^30 keys can be dangerous due to collisions: use 128-bit hash codes instead.
and i don't find how i can use a 128-bit hash codes when building