torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Optional low memory mode for --derep_fulllength #475

Closed DominikBuchner closed 1 year ago

DominikBuchner commented 2 years ago

Hello there,

would it be possible to add a low memory mode for the --derep_fulllength command? Many metabarcoding pipelines need a global dereplication before clustering or denoising. For a large file, vsearch will consume too much memory. With datasets getting larger in the future I'd guess that this will happen more often. Would it be possible to write the hash-table to a file instead of storing all of it in memory or organizing the algorithm in a less memory intense way?

For example: I'm struggling to dereplicate a 160 Gb fasta even with 128 Gb RAM. It works in the end but only because Windows starts to write a pagefile where the values are stored. This slows down the algorithm but it can at least finish. For global dereplication of large datasets that only needs to be done every no and then it would be great to have a slower but more stable algorithm.

torognes commented 2 years ago

Thanks for the suggestion. I understand the need for this. I will consider implementing it in the future, but I do not have any concrete plans now.

torognes commented 2 years ago

Perhaps this can be performed by a two-pass scan of the input file. In the first pass we store the hash of the sequences and count the number of copies. In the second pass we write out the actual sequences and their abundances. Not storing the sequences and the headers will save a lot of memory. A disadvantage would be that the sequences will not be sorted by decreasing abundance, but that may not be crucial, and could be performed in a subsequent phase.

torognes commented 1 year ago

I have now added the derep_smallmem command that should use much less memory (approx one tenth in my tests). It can read both FASTA and FASTQ but only writes FASTA (specified with --fastaout). The input must be a file, not a pipe, as it is read twice. The output is not sorted.

Commit 79a504d3725470560d768dfbc08b9af476733b55.

It would be nice if you could test it.

torognes commented 1 year ago

The derep_smallmem command is included in version 2.22.1 just released.

torognes commented 1 year ago

Here is the text from the manual:

          --derep_smallmem filename
                   Merge  strictly  identical sequences contained in file-
                   name, as with the --derep_fulllength command, but using
                   much less memory. The output is written to a FASTA file
                   specified with the --fastaout  option.  The  output  is
                   written in the order that the sequences first appear in
                   the input, and not in decending abundance order as with
                   the  other dereplication commands. It can read, but not
                   write FASTQ files. This  command  cannot  read  from  a
                   pipe,  it  must  be a proper file, as it is read twice.
                   Dereplication is performed with a 128 bit hash function
                   and it is not verified that grouped sequences are iden-
                   tical,  however  the  probability  that  two  different
                   sequences  are  grouped  in  a dataset of 1 000 000 000
                   unique sequences is approximately 1e-21.   Multithread-
                   ing  and  the  options --topn, --uc, or --tabbedout are
                   not supported.
torognes commented 1 year ago

We could perhaps change the code to also check for hash collisions by reading the input a third time, but then more memory would be needed. Maybe a subsequent run with search_exact could be used to check that all input sequences are present.