sourmash-bio / sourmash-examples

Example commands for doing sequence analysis with sourmash
https://sourmash-bio.github.io/sourmash-examples/
2 stars 0 forks source link

analyze a metagenome using 66,000 GTDB genomic representatives #14

Open ctb opened 2 years ago

ctb commented 2 years ago

This example uses the metagenome signature prepared in https://github.com/sourmash-bio/sourmash-examples/issues/12.

You'll also need to download the GTDB database as in https://github.com/sourmash-bio/sourmash-examples/issues/13.

Now, run sourmash gather:

sourmash gather SRR5950647.sig gtdb-rs207.genomic-reps.dna.k31.zip

This should take about 5 minutes.

The output should look like this:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
383.0 kbp      2.0%    7.8%       1.6    GCF_003697165.2 Escherichia coli DSM ...
187.0 kbp      1.0%    5.0%       1.6    GCF_015074785.1 Prevotella copri stra...
142.0 kbp      0.7%    2.8%       1.4    GCF_000012825.1 Bacteroides vulgatus ...
164.0 kbp      0.3%    1.4%       1.7    GCF_019127135.1 Prevotella copri stra...
found less than 50.0 kbp in common. => exiting

found 4 matches total;
the recovered matches hit 4.0% of the abundance-weighted query

This a minimum metagenome cover for the metagenome, based on the genomes in the GTDB database: in brief, it provides a shortest list of genomes that contain all of the known content in the metagenome (in this case, about 4%).

Note: more of the metagenome might be matched if you used a larger database or a database that included eukaryotic and/or host sequence.

sapuizait commented 3 months ago

That is pretty cool, i tried it on a metagenome (very impressed with the speed btw) but i have a couple of questions...

So, based on the output below, I assume that it identified 1030126 signatures? Then out of them only 65703 were retained... Why is that? It seems awful few as 65k out of a million signatures is almost 6% ? Then when comparing to the k31 DB, 2581 gave matches, thus I assume the rest is unclassified I am wondering how representative that result is?

I made the signature file by combining the two fastq files (like you showed in the previous thread) and compared the .sig file to the k31.lca.json.gz DB

output

select query k=31 automatically. loaded query: 1030126... (k=31, DNA)

loaded 65703 total signatures from 1 locations. after selecting signatures compatible with search, 65703 remain.

Starting prefetch sweep across databases.

Found 2581 signatures via prefetch; now doing gather.

ctb commented 3 months ago

Hi @sapuizait a few quick notes -

HTH!

sapuizait commented 3 months ago

uggghhhhh - oh boy you are right I m such an idiot - I gave a random number for a name and I forgot about it.... sorry about that. OK then, next question: I see how the 50kb is a reasonable overlap BUT if you were willing to sacrifice some accuracy, how low would you go? 20kb? Thanks!

ctb commented 3 months ago

no worries ;).

in re threshold, it's entirely up to you! See discussion here: https://github.com/sourmash-bio/sourmash/issues/2360#issuecomment-2191325045

Note that we now have much faster multithreaded gather available, too; see benchmarks.

sapuizait commented 3 months ago

Excellent - thanks!