sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

gather output confusion #398

Closed alexmsalmeida closed 6 years ago

alexmsalmeida commented 6 years ago

Hi,

Have been playing around with sourmash and it seems to be quite a handy tool. However, I am having trouble making sense of the output data.

I ran sourmash gather with a metagenome assembled genome (MAG) of 5.8 Mb against the GenBank SBT files provided as a prepared database (https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz).

Then, in my report I get:

overlap     p_query p_match 
---------   ------- --------
5.7 Mbp     100.0%  100.0%      CTEK01000001.1 Paenibacillus sp. GM2 ...

found 1 matches total;
the recovered matches hit 50.1% of the query

There are two things that I did not quite understand: one is that although I get 100% p_query and p_match, it only says that I recovered matches for 50% of the query; additionally, although the overlap is 5.7 Mb (which is almost 100% of my MAG at 5.8 Mb) the reference match CTEK01000001.1 is actually one contig of 800 bp (https://www.ncbi.nlm.nih.gov/nuccore/CTEK01000001.1).

Would you be able to help me better understand what is going on?

Many thanks in advance, Alex

ctb commented 6 years ago

Hi Alexandre,

thanks for checking in!

On Sat, Feb 17, 2018 at 11:52:19AM -0800, Alexandre Almeida wrote:

Hi,

Have been playing around with sourmash and it seems to be quite a handy tool. However, I am having trouble making sense of the output data.

...that's because it doesn't make sense!

I ran sourmash gather with a metagenome assembled genome (MAG) of 5.8 Mb against the GenBank SBT files provided as a prepared database (https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz).

Then, in my report I get:

overlap     p_query p_match 
---------   ------- --------
5.7 Mbp     100.0%  100.0%      CTEK01000001.1 Paenibacillus sp. GM2 ...

found 1 matches total;
the recovered matches hit 50.1% of the query

There are two things that I did not quite understand: one is that although I get 100% p_query and p_match, it only says that I recovered matches for 50% of the query; additionally, although the overlap is 5.7 Mb (which is almost 100% of my MAG at 5.8 Mb) the reference match CTEK01000001.1 is actually one contig of 800 bp (https://www.ncbi.nlm.nih.gov/nuccore/CTEK01000001.1).

The second problem is easier to explain and is not a bug but rather a shortcoming of the way we built the database - we named the match after the first contig in the assembly. So what you are really matching to here is the assembly

https://www.ncbi.nlm.nih.gov/assembly/GCF_900069005.1/

which you can reach by clicking on the assembly link on the right.

The first question of why "50.1% of the query is an interesting one. It looks like a straight-up bug in sourmash. I think it might be one that we fixed in the last 6 months or so; could you tell me what

sourmash info

reports as far as versions, and/or communicate the .sig file for your MAG to us? Thanks!!

--titus -- C. Titus Brown, ctbrown@ucdavis.edu

alexmsalmeida commented 6 years ago

Hi Titus, thanks for the quick reply!

I am using sourmash version 2.0.0a2. I actually realized that in the example I provided I was testing as query one of my reference genomes (GCF_900069005) not my MAGs, so it makes sense that it would match that same one at 100%. I am attaching the signature file if you want to test it out.

For clarification, I created this file with:

sourmash compute --scaled 1000 -k 31 -o GCF_900069005.sig GCF_900069005.fa

GCF_900069005.sig.gz

Thanks again

ctb commented 6 years ago

Tried it out with latest (which is now 2.0.0.a3) --

sourmash gather GCF_900069005.sig ../sourmash/genbank/genbank-k31.sbt.json
select query k=31 automatically.
loaded query: GCF_900069005.fa... (k=31, DNA)
loaded 1 databases.

overlap     p_query p_match
---------   ------- --------
5.7 Mbp      50.1%  100.0%      CTEK01000001.1 Paenibacillus sp. GM2 ...

found 1 matches total;
the recovered matches hit 100.0% of the query

this is equally perplexing because clearly the recovered matches don't hit 100% of the query, they hit 49.9% of it :). Sigh. More in a bit.

ctb commented 6 years ago

OK, pretty sure I figured out the problem. It looks like the p_query is wrong. Will fix. Apologies!

alexmsalmeida commented 6 years ago

No worries. That's great! Glad you figured it out.

Looking forward to the fix.

ctb commented 6 years ago

Turns out p_query is wrong if the scaled value of the query signature is lower than the scaled value of the SBT. You can get the "right" numbers by running gather with --scaled 2000 for this case.

ctb commented 6 years ago

(This is a bug, yes.)

alexmsalmeida commented 6 years ago

Cool. I'll make use of that workaround until a more permanent fix. Thanks!

ctb commented 6 years ago

Fixed in #399.