Closed DominikBuchner closed 2 years ago
Thanks for the suggestion. I understand the need for this. I will consider implementing it in the future, but I do not have any concrete plans now.
Perhaps this can be performed by a two-pass scan of the input file. In the first pass we store the hash of the sequences and count the number of copies. In the second pass we write out the actual sequences and their abundances. Not storing the sequences and the headers will save a lot of memory. A disadvantage would be that the sequences will not be sorted by decreasing abundance, but that may not be crucial, and could be performed in a subsequent phase.
I have now added the derep_smallmem
command that should use much less memory (approx one tenth in my tests). It can read both FASTA and FASTQ but only writes FASTA (specified with --fastaout
). The input must be a file, not a pipe, as it is read twice. The output is not sorted.
Commit 79a504d3725470560d768dfbc08b9af476733b55.
It would be nice if you could test it.
The derep_smallmem
command is included in version 2.22.1 just released.
Here is the text from the manual:
--derep_smallmem filename
Merge strictly identical sequences contained in file-
name, as with the --derep_fulllength command, but using
much less memory. The output is written to a FASTA file
specified with the --fastaout option. The output is
written in the order that the sequences first appear in
the input, and not in decending abundance order as with
the other dereplication commands. It can read, but not
write FASTQ files. This command cannot read from a
pipe, it must be a proper file, as it is read twice.
Dereplication is performed with a 128 bit hash function
and it is not verified that grouped sequences are iden-
tical, however the probability that two different
sequences are grouped in a dataset of 1 000 000 000
unique sequences is approximately 1e-21. Multithread-
ing and the options --topn, --uc, or --tabbedout are
not supported.
We could perhaps change the code to also check for hash collisions by reading the input a third time, but then more memory would be needed. Maybe a subsequent run with search_exact
could be used to check that all input sequences are present.
Hello there,
would it be possible to add a low memory mode for the --derep_fulllength command? Many metabarcoding pipelines need a global dereplication before clustering or denoising. For a large file, vsearch will consume too much memory. With datasets getting larger in the future I'd guess that this will happen more often. Would it be possible to write the hash-table to a file instead of storing all of it in memory or organizing the algorithm in a less memory intense way?
For example: I'm struggling to dereplicate a 160 Gb fasta even with 128 Gb RAM. It works in the end but only because Windows starts to write a pagefile where the values are stored. This slows down the algorithm but it can at least finish. For global dereplication of large datasets that only needs to be done every no and then it would be great to have a slower but more stable algorithm.