Open luizirber opened 1 year ago
soooooo... I have Thoughts. But while I think they are not wrong, they are not fully fleshed out.
When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)
The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.
Then you can do things like map reads to those genomes. genome-grist implements this workflow for Illumina read metagenomes.
This is why (I strongly believe, with solid but not yet published receipts ;) sourmash performs so well in Portik et al, 2022. Corollary: single read classification on its own is DoOmED.
However, it's also clear to me that in this case sourmash is simply doing a highly specific and very sensitive database lookup, with no real hope of generalizing to unknown genomes. Maybe that's ok, or maybe not - depends on use case ;).
Hi! I'm trying to get some read-by-read classification results with the following strategy.
## BUILD DB
sourmash sketch fromfile files.csv -p dna,scaled=1000 \
-o /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip
sourmash lca index --scaled 1000 \
-f taxonomy.csv \
inclusion_31_1000.lca.json \
/mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip
## SKETCH QUERY
sourmash sketch dna /fast/jaebeom/long-read/inclusion/prokaryote_inclusion_ont.fq \
-p k=31,scaled=1000 \
--singleton \
-o /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip
## CLASSIFY
sourmash lca classify \
--scaled 1000 \
--query /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip \
--db /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion.lca.js.lca.json \
-o /fast/jaebeom/long-read/results/sourmash/inclusion/ont_lca_classify_31-1000.csv
With this strategy, only 3,271 out of 3,433,969 simulated (using PBSIM3) ONT reads could be classified. The genomes used for the read simulation are also used for indexing the reference DB, so it is expected that much more reads should be classified.
To get a better result, I'm currently testing with scaled=100
as @luizirber suggested.
I'll post the result here.
If you have other opinions or advice, please share it! :)
When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)
The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.
Oooh this is a great point and I totally agree. I recently decontaminated an isoseq transcriptome by:
a la genome grist, but w BLAST instead of read mapping bc the fragments are much longer and there are far fewer of them.
thanks taylor!
a different phrasing: I'd like to classify reads to their most likely genome, not taxon, and that has different challenges!
To get a better result, I'm currently testing with
scaled=100
as @luizirber suggested. I'll post the result here.
Yes, this sounds good to me! I do think sourmash gather
is going to work no worse than, and potentially much better than, sourmash lca
. Plus it should be a lot faster (esp given the optimizations to gather
that we're releasing over in https://github.com/sourmash-bio/pyo3_branchwater).
recipe for multigather and tax annotation here: https://github.com/sourmash-bio/sourmash/issues/2816#issuecomment-1770755051
from a private message:
Posting here and opening the floor for suggestions =]
A clarification question: is the goal to assign taxonomy to each read? Since reads are expected to be long
scaled=1000
might work, but probably will have to go toscaled=100
, which means building new databases (ours in https://sourmash.readthedocs.io/en/latest/databases.html are usuallyscaled=1000
).We usually do taxonomic profiling, which would be one sig for the dataset and then running
gather
. But that doesn't do read-level classification...cc @ctb @bluegenes