sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Long read taxonomy classifier: recommended command line? #2758

Open luizirber opened 1 year ago

luizirber commented 1 year ago

from a private message:

We are currently benchmarking sourmash as long read taxonomy classifier, meaning mapping them against genomes to assign taxonomy labels. But we struggle a bit on how best to do this. Do you have a recommended command line for this task? Embedding each read separately is a bit slow.

Posting here and opening the floor for suggestions =]

A clarification question: is the goal to assign taxonomy to each read? Since reads are expected to be long scaled=1000 might work, but probably will have to go to scaled=100, which means building new databases (ours in https://sourmash.readthedocs.io/en/latest/databases.html are usually scaled=1000).

We usually do taxonomic profiling, which would be one sig for the dataset and then running gather. But that doesn't do read-level classification...

cc @ctb @bluegenes

ctb commented 1 year ago

soooooo... I have Thoughts. But while I think they are not wrong, they are not fully fleshed out.

When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)

The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.

Then you can do things like map reads to those genomes. genome-grist implements this workflow for Illumina read metagenomes.

This is why (I strongly believe, with solid but not yet published receipts ;) sourmash performs so well in Portik et al, 2022. Corollary: single read classification on its own is DoOmED.

However, it's also clear to me that in this case sourmash is simply doing a highly specific and very sensitive database lookup, with no real hope of generalizing to unknown genomes. Maybe that's ok, or maybe not - depends on use case ;).

jaebeom-kim commented 1 year ago

Hi! I'm trying to get some read-by-read classification results with the following strategy.

## BUILD DB
sourmash sketch fromfile files.csv -p dna,scaled=1000 \
        -o /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip 

sourmash lca index --scaled 1000 \
        -f taxonomy.csv \
        inclusion_31_1000.lca.json \
        /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip

## SKETCH QUERY
 sourmash sketch dna /fast/jaebeom/long-read/inclusion/prokaryote_inclusion_ont.fq \
         -p k=31,scaled=1000 \
         --singleton \
         -o /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip

## CLASSIFY
sourmash lca classify \
        --scaled 1000 \
        --query /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip \
        --db /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion.lca.js.lca.json \
        -o /fast/jaebeom/long-read/results/sourmash/inclusion/ont_lca_classify_31-1000.csv 

With this strategy, only 3,271 out of 3,433,969 simulated (using PBSIM3) ONT reads could be classified. The genomes used for the read simulation are also used for indexing the reference DB, so it is expected that much more reads should be classified.

To get a better result, I'm currently testing with scaled=100 as @luizirber suggested. I'll post the result here.

If you have other opinions or advice, please share it! :)

taylorreiter commented 1 year ago

When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)

The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.

Oooh this is a great point and I totally agree. I recently decontaminated an isoseq transcriptome by:

  1. running sourmash gather on the whole thing
  2. identifying contaminant matches
  3. downloading those genomes from genbank
  4. BLASTn-ing each transcript against the downloaded genomes
  5. filtering BLAST results for strong matches (large fraction of contig, high percent identity) and removing them as contaminants.

a la genome grist, but w BLAST instead of read mapping bc the fragments are much longer and there are far fewer of them.

ctb commented 1 year ago

thanks taylor!

a different phrasing: I'd like to classify reads to their most likely genome, not taxon, and that has different challenges!

ctb commented 1 year ago

To get a better result, I'm currently testing with scaled=100 as @luizirber suggested. I'll post the result here.

Yes, this sounds good to me! I do think sourmash gather is going to work no worse than, and potentially much better than, sourmash lca. Plus it should be a lot faster (esp given the optimizations to gather that we're releasing over in https://github.com/sourmash-bio/pyo3_branchwater).

ctb commented 11 months ago

recipe for multigather and tax annotation here: https://github.com/sourmash-bio/sourmash/issues/2816#issuecomment-1770755051