Open ctb opened 2 years ago
That is pretty cool, i tried it on a metagenome (very impressed with the speed btw) but i have a couple of questions...
So, based on the output below, I assume that it identified 1030126 signatures? Then out of them only 65703 were retained... Why is that? It seems awful few as 65k out of a million signatures is almost 6% ? Then when comparing to the k31 DB, 2581 gave matches, thus I assume the rest is unclassified I am wondering how representative that result is?
I made the signature file by combining the two fastq files (like you showed in the previous thread) and compared the .sig file to the k31.lca.json.gz DB
loaded 65703 total signatures from 1 locations. after selecting signatures compatible with search, 65703 remain.
Starting prefetch sweep across databases.
Found 2581 signatures via prefetch; now doing gather.
Hi @sapuizait a few quick notes -
sourmash sig rename <sigfile> <newname> -o <newsigfile>
--threshold-bp
) with the metagenome. This is completely dependent on the metagenome - most genomes won't be in any given metagenome.HTH!
uggghhhhh - oh boy you are right I m such an idiot - I gave a random number for a name and I forgot about it.... sorry about that. OK then, next question: I see how the 50kb is a reasonable overlap BUT if you were willing to sacrifice some accuracy, how low would you go? 20kb? Thanks!
no worries ;).
in re threshold, it's entirely up to you! See discussion here: https://github.com/sourmash-bio/sourmash/issues/2360#issuecomment-2191325045
Note that we now have much faster multithreaded gather available, too; see benchmarks.
Excellent - thanks!
This example uses the metagenome signature prepared in https://github.com/sourmash-bio/sourmash-examples/issues/12.
You'll also need to download the GTDB database as in https://github.com/sourmash-bio/sourmash-examples/issues/13.
Now, run
sourmash gather
:This should take about 5 minutes.
The output should look like this:
This a minimum metagenome cover for the metagenome, based on the genomes in the GTDB database: in brief, it provides a shortest list of genomes that contain all of the known content in the metagenome (in this case, about 4%).
Note: more of the metagenome might be matched if you used a larger database or a database that included eukaryotic and/or host sequence.