jermp / fulgor

Fulgor is a fast and space-efficient colored de Bruijn graph index.
MIT License
44 stars 9 forks source link

Num_contigs must be less than 2^32 Aborted (core dumped) #16

Closed rickbeeloo closed 1 year ago

rickbeeloo commented 1 year ago

Hey!

I was trying to build a graph of 400K bacterial genomes and got the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  num_contigs must be less than 2^32
Aborted (core dumped)

1) I'm a bit confused by the error message with num_contigs as the number of contigs I have is much less (99,645,800) So I guess this refers to the number of unitigs in the graph instead?

2) If Yes, any ideas to still make it work? Could it use 2^64 since we have sufficient memory? Any other ideas ? :)

Thanks for another cool tool!

jermp commented 1 year ago

Hello @rickbeeloo, sorry for the late response but I was on vacation.

So this is not an error per se, it's just a design choice to have contig_id be represented in 32 bits in SSHash. This is not something related to Fulgor itself, but to its kmer dictionary SSHash.

Actually, there should be no problem to default to uint64_t here https://github.com/jermp/sshash/blob/master/include/util.hpp#L31. I would then have the whole lookup_result struct be formed by 5 uint64_t ints. This change would also need to use constants::invalid_uint64 instead of constants::invalid_uint32 whenever appropriate, i.e., in the struct constructor and here https://github.com/jermp/sshash/blob/master/include/buckets.hpp#L261.

So you can try to comment out this check here https://github.com/jermp/sshash/blob/master/include/builder/util.hpp#L77 and make the changes above.

So I guess this refers to the number of unitigs in the graph instead? Yes, by "contigs" I meant the the number of "tigs" in the graph. In the case of Fulgor, this is exactly the number of unitigs as you said.

Since you're trying to build a sshash dictionary for a veery large collection, it would probably also require you to switch to 128-bit hash codes: https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L51.

Let me know. Feel free to also share (perhaps via Box or GoogleDrive) the unitigs file to index so that I can batter assist you.

Cheers, -Giulio

rickbeeloo commented 1 year ago

Hey @jermp, No problem, hope you had a good holiday :) I just looked and I think I have to make some more adjustments, but I will give it a shot. Regarding the sharing of the unitig index, which ones would you need:

-rw-r--r-- 1 rickb binf 477G Aug 23 22:51 fulgor_index.fa
-rw-r--r-- 1 rickb binf  147 Aug 23 16:36 fulgor_index.ggcat.colors.dat
-rw-r--r-- 1 rickb binf 949G Aug 23 20:58 fulgor_index.ggcat.fa

(or also the input genomes itself?)

Thanks in advance, would be awesome to make this work :)

jermp commented 1 year ago

Hi @rickbeeloo,

the file fulgor_index.fa is the one containing the unitig sequences that are indexed by SSHash, created here https://github.com/jermp/fulgor/blob/main/src/index.cpp#L18.

Wow, your input collection is huge ;) I'm very curious of the Fulgor performance in this case. Can you tell me what kind of collection are you indexing? Heterogenous genomes or from a specific species, e.g., Salmonella? (In our paper we tried both datasets. Specifically, pangenomes of salmonella.)

rickbeeloo commented 1 year ago

@jermp, It's heterogenous, that's also the reason for the large number of unitigs haha. Themisto actually crashes on it without error message. Perhaps also the same reason but without handling the exception, didn't check into detail as I thought with this many colors fulgor would probably be better suited for the task

It's probably much easier for you than for me to fix the int64s and int128s and try to see if it works, but if it's too big I could clone the repo and we can add the changes.

(Btw will Fulgor detect the index or will it restart GGCAT?)

jermp commented 1 year ago

@jermp, It's heterogenous, that's also the reason for the large number of unitigs haha.

Ok, that seems reasonable.

It's probably much easier for you than for me to fix the int64s and int128s and try to see if it works, but if it's too big I could clone the repo and we can add the changes.

To do so, I would need the files anyway. I do not have much time at the moment but I can surely assist you if you're willing to make the changes to the SSHash code.

(Btw will Fulgor detect the index or will it restart GGCAT?)

Good point, actually no -- I did not implement a check because the construction is meant to be performed once and those files deleted after. For SSHash it is easy: just do not have the file fulgor_index.fa created again by commenting this out https://github.com/jermp/fulgor/blob/main/src/index.cpp#L18. For GGCAT, I do not know but I'm confident there is a way.

Anyway, what I would do first is to forget about Fulgor for 1 second e try SSHash alone. Make the changes I indicated above to the SSHash codebase (not Fulgor) and try to index with sshash the file fulgor_index.fa. I can also do it but I need the file first.

Thanks, -Giulio

rickbeeloo commented 1 year ago

@jermp Lets do it :) I edited it here, should the node struct also be have int64s? That would require some more changes.

jermp commented 1 year ago

Cool!

No, whatever is inside the /cover is not impacting what we're trying to do.

I would then run SSHash, first on a small prefix of the whole file and check its correctness with the --check flag.

rickbeeloo commented 1 year ago

That passes on a 200MB sample, although it might still go wrong when it goes beyond the uint32 tigs but I suppose using --check on such a big test file with take long so can immediately use the actual data.

Will let that index and come back here.

I also saw this line in Fulgor so probably also need that and perhaps more changes to Fulgor?

jermp commented 1 year ago

Yes, testing the correctness on such data will take a long time. How many billions of distinct kmers are there?

I also saw this line in Fulgor so probably also need that and perhaps more changes to Fulgor?

Ah yeah. Needs careful checking. Thanks for looking into this!

rickbeeloo commented 1 year ago

Just got an error when running sshash on the whole dataset (it works fine on a small sample):

num_kmers 280552988962
num_super_kmers 49099290483
num_pieces 7017211274 (+1.50072 [bits/kmer])
=== step 1: 'parse_file' 34362.5 [sec] (122.481 [ns/kmer])
 == files to merge = 1670
terminate called after throwing an instance of 'std::runtime_error'
  what():  cannot open file
Aborted (core dumped)

So we have around 280 billion k-mers. Don't really understand what file it was unable to open, all the 1670 temp minimizer files are present. Unless it's something after that, any ideas?

jermp commented 1 year ago

Mmmh that's interesting yes. Wow, 280 B kmers is huge...and very fragmented.

Perhaps the issue is here https://github.com/jermp/sshash/blob/master/include/builder/util.hpp#L287. Never happened before, though. It should have created a file called sshash.tmp.run_[RUN-ID].minimizers.bin, where [RUN-ID] is just a large pseudo-random int. Can you check if it is present?

Are you sure you have enough disk space?

rickbeeloo commented 1 year ago

That file is indeed absent. I'm running it again with these changes to show the file paths. Then I can check if I can manually open these handles or that indeed something weird is going on.

Space cannot be the issue as after writing all the temp minimizers files there is still 44TB of space left.

Will update you probably tomorrow as this will take ~24 hours.

rickbeeloo commented 1 year ago

Hey @jermp,

I added a specific log for all "cannot open file" errors (here). So it can't be any of those raising the error. The only thing that remains is the fm_iterator, that does open file handles for all the files. This might push the ulimit and then crash when trying to open the write handle where the exception is caught (just a guess). My ulimit (1024) was below the number of files to merge and you might have not experienced this for smaller datasets as you didn't have that many minimizer files. Still a guess and to be sure we have to wait another day :) - once again hahah

jermp commented 1 year ago

Hi @rickbeeloo, oh I see. Good catch. Never happened to me before because I never run it on such a huge instance. Thank you for your effort! I'm curious to see the final result. After Fulgor is done, there is also the Mac-dGB to try ;)

rickbeeloo commented 1 year ago

Hey @jermp, I can confirm the ulimit was the issue for that error. Now it has been at this for the last 2 days:

num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 31678995297
num_kmers belonging to buckets of size > 128 and <= 256: 54984485401
num_kmers belonging to buckets of size > 256 and <= 512: 62094381955
num_kmers belonging to buckets of size > 512 and <= 1024: 48226793310
num_kmers belonging to buckets of size > 1024 and <= 2048: 30635060978
num_kmers belonging to buckets of size > 2048 and <= 4096: 17677032035
num_kmers belonging to buckets of size > 4096 and <= 769903: 13912102589
num_kmers_in_skew_index 259208851565(92.3921%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 31678995297

But I guess that makes sense with this much data. Maybe we could use more than 7 threads for the hashing? or will this also affect RAM a lot? (It's at 1.5TB RAM at the moment).

Mac-dGB looking cool :) hope I can reuse the sshash index haha

jermp commented 1 year ago

Hi @rickbeeloo, ok perfect, good to know. We will then need to alter the limit manually before launching the construction. With 32B keys and conservative parameters, PTHash will take a lot to build yes. It would probably make sense, in this cases, to switch to a partitioned representation.

I'm concerned anyway by the percentage of kmers in the skew index: it's above 92%...that's too much. What minimizer length are you using? For such huge collection, I would use m=21 or similar.

In practical uses of sshash, the skew index component should contain a few percent of the kmers. Here is the opposite.

rickbeeloo commented 1 year ago

I used m=15, I could bring that up to 21. What do you think would make the most sense?

1) m=21, go from 7 threads to 100 or so (not sure if this affects RAM in that case we could only double or so) 2) use MacGB? If this is what you meant with "to switch to a partitioned representation" 3) or any other better idea of course :)

jermp commented 1 year ago

I would use m=21 as a first try. It looks like the dataset is very fragmented. To build the Mac-dGB, we would currently need a Fulgor index anyway, so let's build Fulgor first.

For a "partitioned representation" I mean for the MPHFs used for minimizers and the skew index. This is a technical detail: currently, we are using a single monolithic function which will result in better lookup at the expense of construction time. So, instead of using pthash::single_phf here https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L56, we would use pthash::partitioned_phf and then specify a number of partitions and more threads here https://github.com/jermp/sshash/blob/master/include/minimizers.hpp#L10 and here https://github.com/jermp/sshash/blob/master/include/builder/build_skew_index.hpp#L129.

Sorry, it requires a bit of understanding of the inner data structures, but the changes are not difficult to make. You have my whole support if you're willing to try. (It would be easier for me but I would need your data. Where did you collect the data from btw?)

rickbeeloo commented 1 year ago

hey @jermp, I added more threads (I saw the limit was 48 for pthash) and the use of the partitioned_phf. Is there any "rule" to select the optimal number of partitions? Should we adjust c for this as well?

It's public data, I will get it on a Google Drive /our server and email you the link to the unitig fasta (might take a few days to arrange that with the sys admin)

jermp commented 1 year ago

Hi @rickbeeloo, right, good point. I usually let a partition_size be a couple of million keys, say 3-5M, and let num_partitions = n / partition_size. I would use a value of c between 4 and 6.

It's public data, I will get it on a Google Drive /our server and email you the link to the unitig fasta (might take a few days to arrange that with the sys admin)

Excellent, thank you!

rickbeeloo commented 1 year ago

That probably requires some help @jermp :)

Minimizers, include/minimizers.hpp Could use more threads here but looking at PThash calculates the partitions per thread here so threads should at least be the number of partitions. I couldn't immediately find what size was here but from Pthash that should be the num_keys so then it would be:

num_partitions = std::ceil(size / 3e6); // 3 million keys per partition
num_threads =  std::thread::hardware_concurrency() > num_partitions ? num_partitions : std::thread::hardware_concurrency();

1) Something like that?


Skew index, include/builder/build_skew_index.hpp I could do the same as above, but there is this check , which is calculated above: num_partitions = max_log2_size - min_log2_size + 1 ( so 12-6+1=7) I guess that's why you have this thread check: std::thread::hardware_concurrency() >= 8 ? 8 : 1; since it will have only 7 partitions anyway.

2) Should this number (or conditionally line 66) of partitions be the same as those used for the MPHF? I assume yes looking at this code together with this. These constants will now scale those down to that range whereas we can probably use more threads like for the minimizers. If so how would we calculate the num_partitions at the top?

jermp commented 1 year ago

1) I have a better strategy. Instead of fixing the partition_size, just set num_partitions as the number of threads you want to use. That would achieve almost ideal speed up as each partition is built independently by one thread. So, just do config.num_partitions = std::thread::hardware_concurrency().

2) The num_partitions of the skew index has nothing to do with the num_partitions of the MPHF config and, thus, can be completely different. So, do not touch this quantity num_partitions, just set mphf_config.num_partitions = std::thread::hardware_concurrency() as above and mphf_config.num_threads = std::thread::hardware_concurrency().

rickbeeloo commented 1 year ago

Hey @jermp,

I did some tests. It works fine for the bigger data (also passing --check) but not for the smaller ones.

1) 100_000 sequences

k = 31, m = 21, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file 'subset.fa'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1694273310903931243.minimizers.0.bin'...
read 50000 sequences, 1679434 bases, 179434 kmers
num_kmers 179434
num_super_kmers 71699
num_pieces 50001 (+16.7196 [bits/kmer])
=== step 1: 'parse_file' 0.072531 [sec] (404.221 [ns/kmer])
terminate called after throwing an instance of 'std::runtime_error'
  what():  average partition size is too small: use less partitions
Aborted (core dumped)

2) 100_000_000 sequences

...
num_partitions 6
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 760215
num_kmers belonging to buckets of size > 128 and <= 256: 313273
num_kmers belonging to buckets of size > 256 and <= 512: 111045
num_kmers belonging to buckets of size > 512 and <= 1024: 53836
num_kmers belonging to buckets of size > 1024 and <= 2048: 26894
num_kmers belonging to buckets of size > 2048 and <= 2185: 3001
num_kmers_in_skew_index 1268264(0.886757%)
building PTHash mphfs (with 96 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 760215
  built mphs[0] for 760215 keys; bits/key = 2.37552
  built positions[0] for 760215 keys; bits/key = 7.00046
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 313273
  built mphs[1] for 313273 keys; bits/key = 2.49898
  built positions[1] for 313273 keys; bits/key = 8.0012
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 111045
  built mphs[2] for 111045 keys; bits/key = 2.58
  built positions[2] for 111045 keys; bits/key = 9.00305
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 53836
  built mphs[3] for 53836 keys; bits/key = 2.6106
  built positions[3] for 53836 keys; bits/key = 10.0061
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 26894
  built mphs[4] for 26894 keys; bits/key = 2.85209
  built positions[4] for 26894 keys; bits/key = 11.0133
lower 2048; upper 2185; num_bits_per_pos 12; keys_in_partition.size() 3001
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)

3) 200_000_000 sequences Works fine, passess --check


Since it passed the check on the bigger chunk I also tried it on the actual data which succeeded:

num_kmers_in_skew_index 1144356992(0.407893%) <- that's much better :)
The index is 477GB

It took around 3 days, the PTHash went MUCH faster now it could use 112 threads on our machine.

So now the steps would be:

jermp commented 1 year ago

Hi @rickbeeloo, excellent tests! I'm commenting on each of those in order.

1) This one fails because average partition size is too small: use less partitions. Since now we are using num_partitions=num_threads, it seems that for very small data (num. of distinct minimizers) num_partitions=num_threads it's too much. For small data, we have to probably tune the num_partitions as to ensure that this test here https://github.com/jermp/pthash/blob/master/include/builders/external_memory_builder_partitioned_phf.hpp#L49 it is not triggered (currently, we need at least 10,000 keys per partition on average). I'm going to make it a constant exposed by PTHash, so that we can do the check as: if (avg_partition_size < pthash::min_avg_partition_size) then adjust num_partitions.

2) This is strange. Never seen it before...

3) Excellent!

4) Yes, the num_kmers in skew index now seems muuuch better :)

Some quick questions: the results above are just for SSHash, right (not the entire Fulgor pipeline)? The 477GB should be the size of the unitig file in FASTA format on disk, not the space of the SSHash index. Right? Did you save the SSHash index on a file for these tests? SSHash should take way less space than 477GB.

Figuring out how to skip both GGCAT and SShash in Fulgor. I'm not sure if GGCAT is stochastic somewhere otherwise that wouldn't matter as GGCAT only takes <1 day. Previously you said we can skip sshash here. Perhaps the easier would be to just comment out the construction somewhere in GGCAT.hpp for now and pass the same command line args so it reads the existing .fa and .colors.dat from GGCAT.

Yes, I think this is entirely doable. Just a matter of passing this option to the cmd interface and refactoring the code a little bit. But, on the other hand, it is also a bit risky because we have to ensure the client pre-built both GGCAT and SSHash in the manner Fulgor expects. So, at least for the moment being, I would not do this perhaps. Instead, let's be sure that SSHash builds correctly on that whole data and launch the whole Fulgor construction. It will take ~3+1 days, but who cares at this point. I think nobody indexed such a huge collection before, so I'm very curious to see the final result!

rickbeeloo commented 1 year ago

Some quick questions: the results above are just for SSHash, right (not the entire Fulgor pipeline)?

Correct

The 477GB should be the size of the unitig file in FASTA format on disk, not the space of the SSHash index. Right?

Well not much smaller haha:

=== step 4: 'build_skew_index' 4996.82 [sec] (17.8106 [ns/kmer])
=== total_time 186744 [sec] (665.629 [ns/kmer])
total index size: 411060 [MB]
SPACE BREAKDOWN:
  minimizers: 0.369042 [bits/kmer]
  pieces: 0.232129 [bits/kmer]
  num_super_kmers_before_bucket: 0.266354 [bits/kmer]
  offsets: 7.31289 [bits/kmer]
  strings: 3.50072 [bits/kmer]
  skew_index: 0.0402878 [bits/kmer]
  weights: 5.24678e-09 [bits/kmer]
    weight_interval_values: 9.12484e-10 [bits/kmer]
    weight_interval_lengths: 3.42181e-09 [bits/kmer]
    weight_dictionary: 9.12484e-10 [bits/kmer]
  --------------
  total: 11.7214 [bits/kmer]
2023-09-09 17:37:46: saving data structure to disk...
2023-09-09 17:48:35: DONE

The index itself is 383G, while the unitig fasta is 477G. Using the 128-bit already doubles the size I suppose, and the partitioned mphf? But yeah now it takes quite some time to load the whole index in RAM haha. I queried the sshash index for the first part of the unitigs.fa and that seems to work (at least kmer matches are found).

jermp commented 1 year ago

Ok, that makes sense (it takes 2 days more or less). Well, we are using < 12 bits per kmer, which is rather good for a such large and branching dBG. You see that it is very branching since we spend 3.5 bits/kmer rather than 2. (using 128-bit hash codes does not change the index size; it only has to do on the type of hash function we use internally.) Of course SSHash works :) Probably we could also have used a slightly smaller m, for example m=19.

Ah, one thing: Fulgor will build a SSHash index with canonical_parsing ON. So, I would do what I said in the previous comment: re-try the whole pipeline now that we have tuned SSHash to work properly on such collection. Or do you want to try something different first?

jermp commented 1 year ago

Ok, as you said, indexation went well for 530,777 genomes. I'm reporting here the result (just for our own reference):

total index size: 1160.33 GB
SPACE BREAKDOWN:
  K2U: 445217736630 bytes / 445.218 GB (38.3701%)
  CCs: 713961091456 bytes / 713.961 GB (61.5311%)
  Other: 1146422457 bytes / 1.14642 GB (0.0988018%)
    U2C: 1105356120 bytes / 1.10536 GB (0.0952626%)
    filenames: 41066337 bytes / 0.0410663 GB (0.00353921%)
Color id range 0..530776
Number of distinct color classes: 1729869474
Number of ints in distinct color classes: 605674142792 (9.4303 bits/int)
k: 31
m: 19 (minimizer length used in K2U)
Number of kmers in dBG: 280552988962 (12.6954 bits/kmer)
Number of unitigs in dBG: 7074278871
CCs SPACE BREAKDOWN:
num. lists of size > 0 and <= 53077: 1729777605 (99.9947%) -- integers: 598670441537 (98.8437%) -- bits/int: 9.44205 -- 98.9669% of total space
num. lists of size > 53077 and <= 106154: 89926 (0.00519843%) -- integers: 6740886925 (1.11296%) -- bits/int: 4.97429 -- 0.587061% of total space
num. lists of size > 106154 and <= 159231: 1696 (9.80421e-05%) -- integers: 206203411 (0.0340453%) -- bits/int: 4.07192 -- 0.0147005% of total space
num. lists of size > 159231 and <= 212308: 103 (5.95421e-06%) -- integers: 17889262 (0.00295361%) -- bits/int: 3.05617 -- 0.000957207% of total space
num. lists of size > 212308 and <= 265385: 83 (4.79805e-06%) -- integers: 19609726 (0.00323767%) -- bits/int: 2.24667 -- 0.000771342% of total space
num. lists of size > 265385 and <= 318462: 28 (1.61862e-06%) -- integers: 8035270 (0.00132667%) -- bits/int: 1.84966 -- 0.000260212% of total space
num. lists of size > 318462 and <= 371539: 33 (1.90766e-06%) -- integers: 11076661 (0.00182882%) -- bits/int: 1.58139 -- 0.000306679% of total space
  colors: 9.38984 bits/int
  offsets: 0.0404647 bits/int
2023-09-22 11:01:43: DONE
** building the index took 474727 seconds / 7912.12 minutes
2023-09-22 11:01:43: saving index to disk...
2023-09-22 11:35:02: DONE