msalamon2 commented 2 months ago

Hello,

I tried to run the new version of vsearch with sintax on a computing cluster, but the processing was extremely slow despite the large amount of computing resources requested (4775 MB per core) and threading (40 cores). The input ASV fasta file is 4.6MB for 10,710 ASVs, and the reference database is the complete Eukaryote COI BOLD database (1.7GB, 2216285 sequences).

vsearch ran for 13 days, but only outputed a 72.7KB one column file, which seem to indicate that only 6236 ASVs were processed. Below is the head and tail of the output file:

ASV_7
ASV_20
ASV_16
ASV_17
ASV_10
ASV_19
ASV_12
ASV_34
ASV_35
ASV_9
... ASV_6228
ASV_6229
ASV_6230
ASV_6231
ASV_6232
ASV_6233
ASV_6234
ASV_6235
ASV_6236

Here is the script for the .sh file used to run vsearch: `#!/bin/bash

SBATCH --mem-per-cpu=4775M

SBATCH --cpus-per-task=40

SBATCH --time=48:00:00

SBATCH --account=def-mcristes

SBATCH --mail-user=mathilde.salamon@mcgill.ca

SBATCH --mail-type=ALL

module load StdEnv/2020 vsearch/2.28.1

Run VSEARCH

vsearch --sintax ASVs_Malaise_traps_DADA2.fasta \ --sintax_random \ --db SINTAX_COI_v5.1.0ref.fasta \ --tabbedout rdp_sintax_unoise3_COI.txt \ --sintax_cutoff 0.8 \ --strand both \ --threads 40 \ --log sintax_COI_MalaiseTraps_log.txt`

I am unsure why the program was so slow, could this be due to the very large reference database ?

Thank you for your help, Best wishes, Mathilde Salamon

torognes commented 2 months ago

Hi Mathilde,

Thank you for reporting this issue.

Both the time used and the lack of any results for most of the sequences look very strange. I have therefore tried to reproduce your efforts and downloaded the SINTAX_COI_v5.1.0ref.fasta file from the https://github.com/terrimporter/CO1Classifier repository.

It seems like the problem is related to masking of the sequences in the database. By default, vsearch applies "soft masking" to the sequences in the databases. That means that all lower case letters are masked and not used during the initial stage of sequence comparison. It is described in the manual, but it is not mentioned for the sintax command, so we need to improve the documentation. Perhaps it should not even be applied by default for this command. Since the database seems to only contain lower case letters for the nucleotide symbols, all of the sequences are masked, leaving no results.

I am sorry that you have wasted 13 days of computation time (times 40 cpus) with this. The good news is that this problem can be easily resolved by including the --dbmask none option on the command line. When I did this with 10710 randomly subsampled sequences from the same database, the whole run completed in under 10 minutes using 8 threads and less than 6GB memory on my Macbook. And the results looked reasonable.

torognes commented 2 months ago

For a future release of vsearch we should consider:

Update documentation regarding soft-masking and sintax
Should soft-masking be applied at all (by default) for the sintax command?
Should a warning be issued when detecting fully masked sequences in the query or database, in general?

msalamon2 commented 2 months ago

Hi Torbjørn,

thank you very much for your quick response, explanation, and for running the test, it was very insightful ! I'm glad this is such an easy fix, because I was planning to use vsearch with sintax for all my databases.

Best wishes, Mathilde Salamon

torognes / vsearch

Very slow processing sintax vsearch/2.28.1 #570