Guidance for working with large reference data sets

ohickl commented 1 year ago

Hi,

I am trying to build a database from RefSeq and GenBank genomes. The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with: make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"

What will the peak memory consumption during build be, when partitioning in x M sized partitions? Atm, I am running with ${p2mc}/metacache-partition-genomes ${p2g} 1600000, since I have at max ~ 1.9T of memory available. Is the partition size reasonable? Does it matter for the partition size calculation, if the genomes are compressed or not?
Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in https://github.com/muellan/metacache/issues/33#issuecomment-1174742729, with advice to build fewer partitions, to keep the merging time in check. Should I then try to find the maximum partition size that will fit into memory during building? Since I am partitioning anyway, do I actually need to compile with uint64_t, or check the largest partition, sequence count-wise, and see if I can get away with uint32_t?
Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?
I chose -kmerlen 20 based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level? Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?
Since the reference data set is large, should some of the advanced options be set/adjusted e.g. -remove-overpopulated-features? If so, what values should be chosen, based on the reference data?

I would be very grateful for any guidance you have.

Best

Oskar

muellan commented 1 year ago

Wow that is a large data set. I'll have to think about your questions. Maybe @Funatiq can also chime in.

to point 4) One thing I would advise right away is to keep the kmer length at 16 and the kmer data type at 32 bits. The difference in classification accuracy is not that big but the runtime and memory impact will be quite substantial.

muellan commented 1 year ago

1) The peak memory consumption with default settings is usually around 2 times the size of the uncompressed sequences and the database size on disk is usually the same as the uncompressed sequences size.

2) If you can limit the number of sequences (note this is the number of individual sequences in the FASTA/FASTQ files) to less than 65k per partition that would be great as this influences memory usage greatly.

3) That is hard to say as it depends on many factors: I/O speed (the merge mode needs to read intermediate results from disk), available RAM, etc. The only advice I could give based on our experience is to keep the number of partitions low, basically try to make the largest partition barely fit into RAM.

4) You should try to use the default kmer length and data type. Larger kmers will greatly affect memory consumption and the classification accuracy gain is not that big (at least when using databases in the 10s of GB). There is however the danger of having to many identical kmers when inserting hundreds of gigabytes of sequences - but this is mitigated by having more partitions. So I would say that as long as a partition is not larger than 100GB a kmer length of 16 should be fine.

5) You should use -remove-overpopulated-features. This is probably the option that affects the query runtime the most. In our experience, even with very long genomes and a total DB size below 100GB, you don't need to specifiy -max-locations-per-feature, meaning that the default value will be fine.

In summary: I would first try to build one single partition with k=16 (32bit kmers) and perform a few experiments to estimate the runtime performance and accuracy based on this single partition. After everything works satisfactory for this single partition I would then build all partitions.

I'll be honest, we have never built a database in the TB range.

ohickl commented 1 year ago

Thanks a lot for the detailed answer! Regarding the issue of having identical kmers with large partitions, but also wanting to create as few as possible to decrease run time, would you then still advise for the creation of large partitions instead of many 100G ones?

muellan commented 1 year ago

You just might need to experiment a bit with all of that. I guess I would start with larger partitions (1TB or more) and reduce the partition size in case of any problem / poor performance / poor classification results.

Just to clarify: the genome window id type only needs to be 64 bits if you want to insert genomes for which length/windowstride is larger than 2^32 (so 4 billion); the default window stride is windowsize-k+1, so 128-16+1=113. I think even the largest plant genomes don't have sequences longer than 113 * 2^32.
I also think that 32bit for the target id should be enough unless you have more than 2^32 (so roughly 4 billion) sequences in a partition.
You might need to set -max-locations-per-feature to something smaller than the default (254) when using k=16 with TB-size partitions, but again, we don't have experience with databases in the TB-range. If the setting is too large runtime performance may suffer, if it is too small (I would not recommend values < 50) classification sensitivity will suffer.

BTW: Can you tell me a bit about your use case and the hardware specs of your system(s)?

ChillarAnand commented 1 month ago

I am working with ~500K genomes comprising about ~1.5TB. Is there any way to speed up the index building?

I have created another issue related to memory mapping. https://github.com/muellan/metacache/issues/43

muellan commented 1 month ago

@ChillarAnand: If you have access to a GPU system you can use the GPU version of metacache which is able to build even very large database files within seconds to a few minutes. This however will also require to partition your database so that the partitions fit into the GPU memory. This is even faster when you have access to a multi-GPU system like an NVIDIA DGX.

The produced database files are compatible with the CPU version. So you can build databases on the GPU and then query it on the CPU.

Regarding database loading see my comment in https://github.com/muellan/metacache/issues/43

muellan / metacache

Guidance for working with large reference data sets #37