Closed jodyphelan closed 1 year ago
hi @jodyphelan main thing - start by using sourmash gather contig.fa.sig dengue_refs.fa.sig
. gather
uses largest overlap to find things, while search
by default uses Jaccard similarity; and gather
also post-processes to remove redundancy. See Minimum metagenome covers for far too much information ;).
A few other tips -
sourmash sig cat dengue_refs.fa.sig -o dengue_refs.fa.zip
and then just use the zip file for everythinggive that a try and let us know how it goes!
Note that https://github.com/dib-lab/genome-grist will do the gather and the downstream mapping for you automatically, although for now I'd just make sure gather works for you!
Thanks for your response, super helpful!
Using gather (sourmash gather contig.fa.sig dengue_refs.fa.zip --threshold-bp 1000
) resulted in the right reference being returned:
overlap p_query p_match
--------- ------- -------
4.0 kbp 40.0% 50.0% MH823208.1
1.0 kbp 10.0% 12.5% JX475906.1
1.0 kbp 10.0% 12.5% KU509276.1
3.0 kbp 10.0% 9.1% MG189962.1
I'm using --threshold-bp 1000
here as the genome is roughly 11kb.
Now sometimes the genome assembly approach doesn't work so I was wondering if it is possible to use the same apprach but with the raw reads?
I've used this for sketching (MH823208.1 is there but a lot lower down the list):
sourmash sketch dna --merge test reads1.fastq.gz .reads2.fastq.gz -o reads.sig
And searching in the same way produces a long list:
overlap p_query p_match
--------- ------- -------
10.0 kbp 0.0% 100.0% LC436672.1
9.0 kbp 0.0% 64.3% FJ882523.1
7.0 kbp 0.0% 75.0% MN018350.1
8.0 kbp 0.0% 41.7% GQ398263.1
5.0 kbp 0.0% 50.0% KJ830750.1
.... many more ....
Is this the right approach to take?
yep! using the raw reads is fine!
it looks like you have pretty small overlaps, so if you wanted to get higher resolution matches, you could sketch everything with -p scaled=100
. Files will be 10x larger, matches will be 10x more sensitive.
Thanks that worked nicely!
I've got some dengue NGS data and I'm wondering if I can use sourmash to find the best matching reference.
Doing an assembly and blasting that works quite nicely which returns a genome with high similarity:
I've built a database from ~4800 reference sequnces using the following:
Then tried to use
sourmash search contig.fa.sig dengue_refs.fa.sig
however this doesn't seem to bring up this hit.Side node: I am getting the following warning (not sure if I should be concerned, or how I should fix this):
Obviously, I am comparing apples and oranges here in terms of the two different tools but I'm wondering if maybe I need to adjust some of the default parameters? Ideally, I would like to find the best match straight from raw reads, hence why I don't want to use blast. Let me know if you have any guidance.