sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

update documentation re k-mer trimming and gather vs search #2122

Open ctb opened 2 years ago

ctb commented 2 years ago

a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.

but, for gather in particular, this is not particularly important. we should update documentation with a few points -

@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.

ref trimming paragraph in https://github.com/sourmash-bio/sourmash/issues/1135:

Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance

ctb commented 1 year ago

per conversation in https://github.com/sourmash-bio/sourmash/issues/2266, I think adapter trimming and quality filtering is also generally unnecessary. Although if QC programs like FastQC tell you you have a major problem, that might be something worth doing (or, resequencing ;).

ctb commented 1 year ago

here are some results from sourmash gather running on four different metagenomes after the report-both-weighted-and-unweighted PR was merged, https://github.com/sourmash-bio/sourmash/pull/2301:

(duplicated from https://github.com/dib-lab/genome-grist/issues/197#issuecomment-1263687946)

here are the results!

metagenome unweighted match weighted match
SRR606249 (podar) 47.8% 95.9%
SRR1976948 (hu-s1) 21.4% 68.0%
SRR12324253 (zymo mock) 16.2% 96.6%
SRR5650070 (p8808mo11/iHMP) 34.8% 85.6%

a few observations -

Note this is with genome-grist v0.9.0, so adapter and quality trimmed only.