refresh-bio / vclust

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes
GNU General Public License v3.0
48 stars 1 forks source link

Excessive use of memory (precompiled linux x64 binary) #8

Closed LanderDC closed 3 months ago

LanderDC commented 3 months ago

Hi again,

Opening a new issue because this is unrelated to #7.

While doing the comparison of #7, I noticed that Vclust needs a lot of memory (~1.75TB for 1 million sequences). I'm not sure in which subcommand the peak memory usage is reached but my jobs already failed due to a memory issue at the prefiltering step on a machine that has 256GiB RAM. Just letting you know, because it seems related to #6 (not sure though) as I'm using the same precompiled binary. I haven't tried to compile the software myself to see if this would fix this.

afbeelding

The specifications of my machine:

aziele commented 3 months ago

Hi!

Thank you for reporting this issue and apologize for the inconvenience.

The excessive use of RAM might be due to the nature of your dataset of input genomes. Vclust was primarily designed to be efficient with large datasets of diverse viruses (covering a wide range of ANI). However, it struggles with large datasets of highly similar or nearly identical genomes (e.g., tens of thousands of SARS-CoV-2 genomes). This limitation arises because the tool uses sparse matrices to store ANI values between pairs of sequences. In large datasets of diverse viral genomes (e.g., 15 million sequences from IMG/VR), most possible pairs have no similarity and thus are not included in the sparse matrix, keeping RAM usage low. Conversely, in datasets with high sequence identity, the matrix becomes dense, with its size growing quadratically with the number of sequences, often exceeding memory capacity. This dense matrix also increases running time due to the higher number of pairs needing alignment and clustering. We mention this limitation in our preprint and the 'Limitations' section of this repository's README.md.

We definitely plan to update Vclust to handle large datasets of homogeneous sequences, but we can't provide an estimated timeline at the moment.

If your issue is not due to very similar sequences, we would be happy to take a look at your dataset, if possible.

Andrzej

LanderDC commented 3 months ago

Ok I see, thanks for this information. The input data comes from virome sequencing of mosquitoes, so I didn't expect a lot of highly similar sequences. However, there are a lot of short sequences in there (average length is about 600 bases) and looking at the clustering, it yields ~500.000 cluster representatives. So I would say that around 50% of the sequences have >=95% ANI to the other half of the dataset. I guess this might already be enough to trigger the high RAM usage?

aziele commented 3 months ago

You're right, there aren't many identical sequences in the set. If you'd like (and if possible), we can analyze the set on our side to find out more and work on a solution.

LanderDC commented 3 months ago

Sure, I can share them privately. How do you prefer to receive them (the file is around 675MB)?

aziele commented 3 months ago

Thanks, I just emailed you.

aziele commented 3 months ago

The excessive RAM usage has been fixed in Vclust version 1.0.3.