torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Too many "alignment score too low, or score drop too high" during pair-end merge compared with FLASH #526

Closed billzt closed 1 year ago

billzt commented 1 year ago

Testing data: https://www.ncbi.nlm.nih.gov/sra/?term=DRR126155 6714 pairs

FLASH (v1.2.7), using all default parameter, i.e.

  -m, --min-overlap=NUM  Default: 10
 -x, --max-mismatch-density=NUM Default: 0.25.

The result is:

# [FLASH] Processed 6714 read pairs
# [FLASH]  
# [FLASH] Read combination statistics:
# [FLASH]     Total reads:      6714
# [FLASH]     Combined reads:   6331
# [FLASH]     Uncombined reads: 383
# [FLASH]     Percent combined: 94.30%

The merging rate is very high.

vsearch (v2.22.1_linux_x86_64), command is

vsearch --fastq_mergepairs DRR126155.1.fq.gz -reverse DRR126155.2.fq.gz \
--fastq_allowmergestagger --fastaout DRR126155.merge.vsearch.fa --fastq_maxdiffpct 25 \
--fastq_maxdiffs 100000

That is, making maximum percentage diff. bases in overlap = 25%, equal with FLASH

The result is:

Merging reads 100% 
      6714  Pairs
      5696  Merged (84.8%)
      1018  Not merged (15.2%)

Pairs that failed merging due to various reasons:
        83  too few kmers found on same diagonal
         3  too high percentage of differences
       932  alignment score too low, or score drop too high

Statistics of all reads:
    148.27  Mean read length

Statistics of merged reads:
    225.33  Mean fragment length
     25.95  Standard deviation of fragment length
      0.81  Mean expected error in forward sequences
      0.55  Mean expected error in reverse sequences
      0.61  Mean expected error in merged sequences
      3.50  Mean observed errors in merged region of forward sequences
      2.63  Mean observed errors in merged region of reverse sequences
      6.13  Mean observed errors in merged region

The percentage of not merging is significantly higher than FLASH, due to many reads regarded as "alignment score too low, or score drop too high"

How can I adjust the parameters to get similar results as FLASH?

frederic-mahe commented 1 year ago

How can I adjust the parameters to get similar results as FLASH?

vsearch's merging algorithm is more conservative than flash's by design. In vsearch, there are three options you can toggle to relax some merging parameters:

There are other more sophisticated rules in the merging algorithm that will discard read pairs with a high fraction of mismatches, but these rules are not controlled by user-facing variables. So, it is currently not possible for end-users to adjust the vsearch's merging algorithm parameters to get similar results as flash.

frederic-mahe commented 1 year ago

@billzt using your 12S data sample I tried to show the difference between the two merging approaches, and the benefit-risk of each. To do so, I use taxonomic assignment as our reference metric, with the idea that amplicons with higher similarity percentages with reference sequences are less likely to be false-positives.

To summarize, flash may produce more false-positives, while vsearch may produce more false-negatives. Choosing one method depends on the scientific question at hand (for instance, is assessing precise alpha-diversity values important?)

The testing dataset initially contains 6,714 pairs. Of these, 5,050 are merged in exactly the same way by both methods (75.2% overlap). vsearch merges 8 additional pairs, but these amplicons end-up not assigned (so possible false-positives). flash merges 1,281 additional pairs, most of them with a similarity of 95% or more with reference sequences. However, the similarity distribution of these 1,281 flash-specific amplicons is not the same as the similarity distribution of the 5,050 amplicons common to both methods. The cumulative distribution function of flash-specific amplicons is greater than the cumulative distribution function of common amplicons, with a p-value < 2.2e-16 (conditions of the Kolmogorov-Smirnov test are violated, but I am not sure what other test would work here). My interpretation is that the 1,281 flash-specific amplicons are not a random subset of the general amplicon pool, but a set skewed toward (relatively) lower similarity percentages with reference sequences, indicating a higher false-positive rate or some erroneous merging.

See code below for future reference:

cd ./issue_526_20230704/

mkdir -p data results

## versions
flash --version    # FLASH v1.2.11
vsearch --version  # vsearch v2.22.1

## raw data
(cd ./data/
 wget \
     ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_1.fastq.gz \
     ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_2.fastq.gz
)

## run flash
(cd ./data/
 flash \
     --output-directory=../results/ \
     DRR126155_1.fastq.gz \
     DRR126155_2.fastq.gz 2>&1 | \
     tee ../results/flash.log
)

(cd ./results/
 vsearch \
     --fastq_filter "out.extendedFrags.fastq" \
     --fastq_ascii 33 \
     --fasta_width 0 \
     --fastaout "DRR126155_flash.fasta"

 rm out.*
)

## run vsearch
(cd ./data/
 vsearch \
     --fastq_mergepairs DRR126155_1.fastq.gz \
     --reverse DRR126155_2.fastq.gz \
     --fastq_allowmergestagger \
     --quiet \
     --fasta_width 0 \
     --fastaout ../results/DRR126155_vsearch.fasta 2>&1 | \
     tee ../results/vsearch.log
)

## dereplication (intra)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 (( $(grep -c "^>" "${FASTA_V}") == $(grep -v "^>" "${FASTA_V}" | sort -u | wc -l) )) && \
     echo "no duplicates in vsearch results"
 (( $(grep -c "^>" "${FASTA_F}") == $(grep -v "^>" "${FASTA_F}" | sort -u | wc -l) )) && \
     echo "no duplicates in flash results"
)

## common (n = 5050) most entries are common!
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -12 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
         -F -f - \
         -B 1 "${FASTA_V}" > "DRR126155_common.fasta"
)

## entries specific to vsearch (n = 8)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -13 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
         -F -f - \
         -B 1 "${FASTA_V}" > "DRR126155_vsearch_exclusive.fasta"
)

## entries specific to flash (n = 1281)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -23 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
          -F -f - -B 1 "${FASTA_F}" > "DRR126155_flash_exclusive.fasta"
)

## taxonomic assignments with a 12S reference database from:
## https://github.com/zjgold/FishCARD
function taxonomic_assignment() {
    local -ir THREADS=4
    local -ir IDDEF=2
    local -r IDENTITY="0.5"
    local -ir MAXREJECTS=32
    local -r SUBJECTS="../data/GenBank_.fasta"
    local ASSIGNMENTS="${1/\.fasta/.hits}"

    vsearch \
        --usearch_global "${1}" \
        --threads "${THREADS}" \
        --dbmask none \
        --qmask none \
        --rowlen 0 \
        --notrunclabels \
        --userfields query+id${IDDEF}+target \
        --maxaccepts 0 \
        --maxrejects "${MAXREJECTS}" \
        --top_hits_only \
        --output_no_hits \
        --db "${SUBJECTS}" \
        --id "${IDENTITY}" \
        --iddef "${IDDEF}" \
        --userout "${ASSIGNMENTS}"
}

(cd ./results/
 taxonomic_assignment "DRR126155_common.fasta"
 taxonomic_assignment "DRR126155_vsearch_exclusive.fasta"
 taxonomic_assignment "DRR126155_flash_exclusive.fasta"
)
library(tidyverse)

setwd("issue_526_20230704/results/")

colnames <- c("amplicon", "similarity", "reference")
sources <- c("common", "flash_exclusive", "vsearch_exclusive")

sources %>%
    purrr::map_chr(~ str_c("DRR126155_", .x, ".hits")) %>%
    purrr::map_dfr(~ read_tsv(.x,
                              col_names = colnames,
                              show_col_types = FALSE),
                   .id = "source") %>%
    select(-reference) %>%
    distinct() -> d

d %>%  
    ggplot(aes(x = source, y = similarity)) +
    geom_violin(scale = "count") +
    coord_cartesian(ylim = c(95, 100))

ks.test(d %>% filter(source == 2) %>% pull(similarity),
        d %>% filter(source == 1) %>% pull(similarity),
        alternative = "two.sided",
        exact = NULL)

ks.test(d %>% filter(source == 2) %>% pull(similarity),
        d %>% filter(source == 1) %>% pull(similarity),
        alternative = "great",
        exact = NULL)

## D^+ = 0.35122, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies above that of y
billzt commented 1 year ago

Thank you

torognes commented 1 year ago

Thanks for explaining and testing, @frederic-mahe !

torognes commented 12 months ago

See also issues #524 and #282.