sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
456 stars 78 forks source link

Improving results for nanopore #2236

Open jsgounot opened 1 year ago

jsgounot commented 1 year ago

Hi. I explore the possibility to use sourmash to identify isolate origin based on nanopore data. Each sample is supposed to have only one species. I know that ONT reads are not ideal for a k-mer approach but as reported in this tracker, at least one paper used those for a paper. I tried to use gather with or without trimming (even though it's not really appropriate, trim-low-abund.py -C 3 -Z 18 -V -M 2e9) and I while the best hit seems concordant with what is expected, the f_orig_query is very low both for raw (mean=2%) and trimmed (mean=5%) data. Did you explore some other sourmash or khmer parameters to improve results with nanopore reads?

ctb commented 1 year ago

this came across lab slack today -

https://labs.epi2me.io/progressive-kraken2/

Luiz said:

Granularity is different (reads, not contigs/genomes), but would be fun to try
with sourmash (maybe with a s=100 db it would work with reads too?)

ctb commented 1 year ago

hi @jsgounot this paper systematically confirms that ONT messes up sourmash -

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets: https://www.biorxiv.org/content/10.1101/2022.01.31.478527v2

See Fig 3 in particular; screenshot:

Screen Shot 2022-11-16 at 6 08 45 AM

It seems pretty clear that the error profile for nanopore is terrible for sourmash :(.

@dportik, @bluegenes and I are thinking of doing a bit more exploring, but we have no simple solution to offer. thoughts welcome!

ctb commented 1 year ago

(see https://github.com/sourmash-bio/sourmash/issues/2360 for some discussion of thresholding that is not entirely irrelevant ;)

jsgounot commented 1 year ago

hi @ctb, thank for you sharing this. Looks like the MEGAN-LR is the good approach for this kind of data at the moment, do you share the same conclusion?

ctb commented 1 year ago

that's my reading as well but @bluegenes @dportik should weigh in!

dportik commented 1 year ago

Hi @jsgounot - as @ctb mentioned the error profile of ONT appears to negatively affect sourmash's performance (at least for now).

There are two good options for ONT. We found BugSeq actually had the best performance - it is highly tuned to ONT. But, that is a cloud-based analysis and you've got to sign up for it. If you are looking for a DIY, I would recommend the DIAMOND & MEGAN-LR approach. That pipeline is available as a snakemake workflow at https://github.com/PacificBiosciences/pb-metagenomics-tools. If you choose to make an independent pipeline for this, just be aware there are some landmines involved with getting the DIAMOND outputs into MEGAN.