Finding best reference from NGS data

jodyphelan commented 1 year ago

I've got some dengue NGS data and I'm wondering if I can use sourmash to find the best matching reference.

Doing an assembly and blasting that works quite nicely which returns a genome with high similarity:

>MH823208.1 Dengue virus type 2 isolate JMB-010, complete genome
Length=10723

 Score = 18355 bits (20356),  Expect = 0.0
 Identities = 10385/10523 (99%), Gaps = 0/10523 (0%)
 Strand=Plus/Plus

Query  1      GTCTACGTGGACCGACAAAGACAGATTCTTTGAGGAAGCTAAGCTTAACGTAGTTCTAAC  60
              |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
Sbjct  9      GTCTACGTGGACCGACAAAGACAGATTCTTCGAGGAAGCTAAGCTTAACGTAGTTCTAAC  68
......

I've built a database from ~4800 reference sequnces using the following:

sourmash sketch dna dengue_refs.fa

Then tried to use sourmash search contig.fa.sig dengue_refs.fa.sig however this doesn't seem to bring up this hit.

84 matches above threshold 0.080; showing first 3:
similarity   match
----------   -----
 30.8%       MK411558.1 Dengue virus type 2 isolate ID/JMB-001M/2016, ...
 28.6%       MH823208.1 Dengue virus type 2 isolate JMB-010, complete ...
 23.1%       EU179857.1 Dengue virus type 2 isolate DS31-291005, compl...

Side node: I am getting the following warning (not sure if I should be concerned, or how I should fix this):

WARNING: size estimation for at least one of these sketches may be inaccurate. ANI values will not be reported for these comparisons.
WARNING: Jaccard estimation for at least one of these comparisons is likely inaccurate. Could not estimate ANI for these comparisons.

Obviously, I am comparing apples and oranges here in terms of the two different tools but I'm wondering if maybe I need to adjust some of the default parameters? Ideally, I would like to find the best match straight from raw reads, hence why I don't want to use blast. Let me know if you have any guidance.

ctb commented 1 year ago

hi @jodyphelan main thing - start by using sourmash gather contig.fa.sig dengue_refs.fa.sig. gather uses largest overlap to find things, while search by default uses Jaccard similarity; and gather also post-processes to remove redundancy. See Minimum metagenome covers for far too much information ;).

A few other tips -

don't worry about the warning for now
life will be at least slightly better if you use .zip files for the database! You can do that with sourmash sig cat dengue_refs.fa.sig -o dengue_refs.fa.zip and then just use the zip file for everything

give that a try and let us know how it goes!

Note that https://github.com/dib-lab/genome-grist will do the gather and the downstream mapping for you automatically, although for now I'd just make sure gather works for you!

jodyphelan commented 1 year ago

Thanks for your response, super helpful!

Using gather (sourmash gather contig.fa.sig dengue_refs.fa.zip --threshold-bp 1000) resulted in the right reference being returned:

overlap     p_query p_match
---------   ------- -------
4.0 kbp       40.0%   50.0%    MH823208.1
1.0 kbp       10.0%   12.5%    JX475906.1
1.0 kbp       10.0%   12.5%    KU509276.1
3.0 kbp       10.0%    9.1%    MG189962.1

I'm using --threshold-bp 1000 here as the genome is roughly 11kb.

Now sometimes the genome assembly approach doesn't work so I was wondering if it is possible to use the same apprach but with the raw reads?

I've used this for sketching (MH823208.1 is there but a lot lower down the list):

sourmash sketch dna --merge test reads1.fastq.gz .reads2.fastq.gz -o reads.sig

And searching in the same way produces a long list:

overlap     p_query p_match
---------   ------- -------
10.0 kbp       0.0%  100.0%    LC436672.1
9.0 kbp        0.0%   64.3%    FJ882523.1
7.0 kbp        0.0%   75.0%    MN018350.1
8.0 kbp        0.0%   41.7%    GQ398263.1
5.0 kbp        0.0%   50.0%    KJ830750.1
.... many more ....

Is this the right approach to take?

ctb commented 1 year ago

yep! using the raw reads is fine!

it looks like you have pretty small overlaps, so if you wanted to get higher resolution matches, you could sketch everything with -p scaled=100. Files will be 10x larger, matches will be 10x more sensitive.

jodyphelan commented 1 year ago

Thanks that worked nicely!

sourmash-bio / sourmash

Finding best reference from NGS data #2638