Closed billzt closed 1 year ago
How can I adjust the parameters to get similar results as FLASH?
vsearch
's merging algorithm is more conservative than flash
's by design. In vsearch
, there are three options you can toggle to relax some merging parameters:
fastq_minovlen
: specify the minimum overlap between the merged reads. The default is 10. Must be at least 5.fastq_maxdiffpct
: specify the maximum percentage of non-matching nucleotides allowed in the overlap region. The default value is 100.0%.fastq_maxdiffs
: specify the maximum number of non-matching nucleotides allowed in the overlap region. That option has a strong influence on the merging success rate. The default value is 10.There are other more sophisticated rules in the merging algorithm that will discard read pairs with a high fraction of mismatches, but these rules are not controlled by user-facing variables. So, it is currently not possible for end-users to adjust the vsearch
's merging algorithm parameters to get similar results as flash
.
@billzt using your 12S data sample I tried to show the difference between the two merging approaches, and the benefit-risk of each. To do so, I use taxonomic assignment as our reference metric, with the idea that amplicons with higher similarity percentages with reference sequences are less likely to be false-positives.
To summarize, flash
may produce more false-positives, while vsearch
may produce more false-negatives. Choosing one method depends on the scientific question at hand (for instance, is assessing precise alpha-diversity values important?)
The testing dataset initially contains 6,714 pairs. Of these, 5,050 are merged in exactly the same way by both methods (75.2% overlap). vsearch
merges 8 additional pairs, but these amplicons end-up not assigned (so possible false-positives). flash
merges 1,281 additional pairs, most of them with a similarity of 95% or more with reference sequences. However, the similarity distribution of these 1,281 flash-specific amplicons is not the same as the similarity distribution of the 5,050 amplicons common to both methods. The cumulative distribution function of flash-specific amplicons is greater than the cumulative distribution function of common amplicons, with a p-value < 2.2e-16 (conditions of the Kolmogorov-Smirnov test are violated, but I am not sure what other test would work here). My interpretation is that the 1,281 flash-specific amplicons are not a random subset of the general amplicon pool, but a set skewed toward (relatively) lower similarity percentages with reference sequences, indicating a higher false-positive rate or some erroneous merging.
See code below for future reference:
cd ./issue_526_20230704/
mkdir -p data results
## versions
flash --version # FLASH v1.2.11
vsearch --version # vsearch v2.22.1
## raw data
(cd ./data/
wget \
ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_1.fastq.gz \
ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_2.fastq.gz
)
## run flash
(cd ./data/
flash \
--output-directory=../results/ \
DRR126155_1.fastq.gz \
DRR126155_2.fastq.gz 2>&1 | \
tee ../results/flash.log
)
(cd ./results/
vsearch \
--fastq_filter "out.extendedFrags.fastq" \
--fastq_ascii 33 \
--fasta_width 0 \
--fastaout "DRR126155_flash.fasta"
rm out.*
)
## run vsearch
(cd ./data/
vsearch \
--fastq_mergepairs DRR126155_1.fastq.gz \
--reverse DRR126155_2.fastq.gz \
--fastq_allowmergestagger \
--quiet \
--fasta_width 0 \
--fastaout ../results/DRR126155_vsearch.fasta 2>&1 | \
tee ../results/vsearch.log
)
## dereplication (intra)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
(( $(grep -c "^>" "${FASTA_V}") == $(grep -v "^>" "${FASTA_V}" | sort -u | wc -l) )) && \
echo "no duplicates in vsearch results"
(( $(grep -c "^>" "${FASTA_F}") == $(grep -v "^>" "${FASTA_F}" | sort -u | wc -l) )) && \
echo "no duplicates in flash results"
)
## common (n = 5050) most entries are common!
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -12 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - \
-B 1 "${FASTA_V}" > "DRR126155_common.fasta"
)
## entries specific to vsearch (n = 8)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -13 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - \
-B 1 "${FASTA_V}" > "DRR126155_vsearch_exclusive.fasta"
)
## entries specific to flash (n = 1281)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -23 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - -B 1 "${FASTA_F}" > "DRR126155_flash_exclusive.fasta"
)
## taxonomic assignments with a 12S reference database from:
## https://github.com/zjgold/FishCARD
function taxonomic_assignment() {
local -ir THREADS=4
local -ir IDDEF=2
local -r IDENTITY="0.5"
local -ir MAXREJECTS=32
local -r SUBJECTS="../data/GenBank_.fasta"
local ASSIGNMENTS="${1/\.fasta/.hits}"
vsearch \
--usearch_global "${1}" \
--threads "${THREADS}" \
--dbmask none \
--qmask none \
--rowlen 0 \
--notrunclabels \
--userfields query+id${IDDEF}+target \
--maxaccepts 0 \
--maxrejects "${MAXREJECTS}" \
--top_hits_only \
--output_no_hits \
--db "${SUBJECTS}" \
--id "${IDENTITY}" \
--iddef "${IDDEF}" \
--userout "${ASSIGNMENTS}"
}
(cd ./results/
taxonomic_assignment "DRR126155_common.fasta"
taxonomic_assignment "DRR126155_vsearch_exclusive.fasta"
taxonomic_assignment "DRR126155_flash_exclusive.fasta"
)
library(tidyverse)
setwd("issue_526_20230704/results/")
colnames <- c("amplicon", "similarity", "reference")
sources <- c("common", "flash_exclusive", "vsearch_exclusive")
sources %>%
purrr::map_chr(~ str_c("DRR126155_", .x, ".hits")) %>%
purrr::map_dfr(~ read_tsv(.x,
col_names = colnames,
show_col_types = FALSE),
.id = "source") %>%
select(-reference) %>%
distinct() -> d
d %>%
ggplot(aes(x = source, y = similarity)) +
geom_violin(scale = "count") +
coord_cartesian(ylim = c(95, 100))
ks.test(d %>% filter(source == 2) %>% pull(similarity),
d %>% filter(source == 1) %>% pull(similarity),
alternative = "two.sided",
exact = NULL)
ks.test(d %>% filter(source == 2) %>% pull(similarity),
d %>% filter(source == 1) %>% pull(similarity),
alternative = "great",
exact = NULL)
## D^+ = 0.35122, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies above that of y
Thank you
Thanks for explaining and testing, @frederic-mahe !
See also issues #524 and #282.
Testing data: https://www.ncbi.nlm.nih.gov/sra/?term=DRR126155 6714 pairs
FLASH (v1.2.7), using all default parameter, i.e.
The result is:
The merging rate is very high.
vsearch (v2.22.1_linux_x86_64), command is
That is, making maximum percentage diff. bases in overlap = 25%, equal with FLASH
The result is:
The percentage of not merging is significantly higher than FLASH, due to many reads regarded as "alignment score too low, or score drop too high"
How can I adjust the parameters to get similar results as FLASH?