Closed alexmsalmeida closed 6 years ago
Hi Alexandre,
thanks for checking in!
On Sat, Feb 17, 2018 at 11:52:19AM -0800, Alexandre Almeida wrote:
Hi,
Have been playing around with sourmash and it seems to be quite a handy tool. However, I am having trouble making sense of the output data.
...that's because it doesn't make sense!
I ran sourmash gather with a metagenome assembled genome (MAG) of 5.8 Mb against the GenBank SBT files provided as a prepared database (https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz).
Then, in my report I get:
overlap p_query p_match --------- ------- -------- 5.7 Mbp 100.0% 100.0% CTEK01000001.1 Paenibacillus sp. GM2 ... found 1 matches total; the recovered matches hit 50.1% of the query
There are two things that I did not quite understand: one is that although I get 100% p_query and p_match, it only says that I recovered matches for 50% of the query; additionally, although the overlap is 5.7 Mb (which is almost 100% of my MAG at 5.8 Mb) the reference match CTEK01000001.1 is actually one contig of 800 bp (https://www.ncbi.nlm.nih.gov/nuccore/CTEK01000001.1).
The second problem is easier to explain and is not a bug but rather a shortcoming of the way we built the database - we named the match after the first contig in the assembly. So what you are really matching to here is the assembly
https://www.ncbi.nlm.nih.gov/assembly/GCF_900069005.1/
which you can reach by clicking on the assembly link on the right.
The first question of why "50.1% of the query is an interesting one. It looks like a straight-up bug in sourmash. I think it might be one that we fixed in the last 6 months or so; could you tell me what
sourmash info
reports as far as versions, and/or communicate the .sig file for your MAG to us? Thanks!!
--titus -- C. Titus Brown, ctbrown@ucdavis.edu
Hi Titus, thanks for the quick reply!
I am using sourmash version 2.0.0a2. I actually realized that in the example I provided I was testing as query one of my reference genomes (GCF_900069005) not my MAGs, so it makes sense that it would match that same one at 100%. I am attaching the signature file if you want to test it out.
For clarification, I created this file with:
sourmash compute --scaled 1000 -k 31 -o GCF_900069005.sig GCF_900069005.fa
Thanks again
Tried it out with latest (which is now 2.0.0.a3) --
sourmash gather GCF_900069005.sig ../sourmash/genbank/genbank-k31.sbt.json
select query k=31 automatically.
loaded query: GCF_900069005.fa... (k=31, DNA)
loaded 1 databases.
overlap p_query p_match
--------- ------- --------
5.7 Mbp 50.1% 100.0% CTEK01000001.1 Paenibacillus sp. GM2 ...
found 1 matches total;
the recovered matches hit 100.0% of the query
this is equally perplexing because clearly the recovered matches don't hit 100% of the query, they hit 49.9% of it :). Sigh. More in a bit.
OK, pretty sure I figured out the problem. It looks like the p_query is wrong. Will fix. Apologies!
No worries. That's great! Glad you figured it out.
Looking forward to the fix.
Turns out p_query
is wrong if the scaled value of the query signature is lower than the scaled value of the SBT. You can get the "right" numbers by running gather
with --scaled 2000
for this case.
(This is a bug, yes.)
Cool. I'll make use of that workaround until a more permanent fix. Thanks!
Fixed in #399.
Hi,
Have been playing around with sourmash and it seems to be quite a handy tool. However, I am having trouble making sense of the output data.
I ran sourmash gather with a metagenome assembled genome (MAG) of 5.8 Mb against the GenBank SBT files provided as a prepared database (https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz).
Then, in my report I get:
There are two things that I did not quite understand: one is that although I get 100% p_query and p_match, it only says that I recovered matches for 50% of the query; additionally, although the overlap is 5.7 Mb (which is almost 100% of my MAG at 5.8 Mb) the reference match CTEK01000001.1 is actually one contig of 800 bp (https://www.ncbi.nlm.nih.gov/nuccore/CTEK01000001.1).
Would you be able to help me better understand what is going on?
Many thanks in advance, Alex