sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Interpreting abundance #549

Closed mytluo closed 4 years ago

mytluo commented 6 years ago

Hi @ctb,

I ran --track-abundance with sourmash gather, and wasn't sure how to interpret the values. Should I be looking at average_abund?

On another note, for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

mytluo commented 6 years ago

Hi @luizirber, I was wondering if you could help me with this problem as well? I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

additionally, may I also ask if I could interpret the counts from lca summarize to abundance? it was suggested the counts might be number of hashes; I tried looking for the total hashes in the signature file, but only got num: 0 after the list of mins hashes. apologies for the bother!

luizirber commented 6 years ago

so, num is a bit misleading because it's a value we use internally in sourmash to control between regular minhashes (like mash) or scaled minhashes. In regular minhashes len(mins) == num, but because you calculated your signatures with --scaled you will need to check len(mins) to get how many hashes were collected.

mytluo commented 6 years ago

@luizirber, thank you! that makes sense... I got 107323 from len(mins), but when I sum up the entire count column from lca summarize, it only adds up to 67072... I would also think this is the wrong sum in general as some of the counts are redundant -- counts from E. coli would also be in Escherichia, up to the Bacteria superkingdom level. But that would lower the total even more, which is strange, given 93.7% of the query was hit. Would you be able to point me in the right direction? I apologize for all the questions, I know I've asked a lot ^^;;

mytluo commented 6 years ago

I realized I forgot the --scaled ^^;; last question (hopefully) -- in the LCA counts, would the count with no superkingdom assigned be considered unclassified reads?

ctb commented 5 years ago

While I try to find time to write more docs, please also see my STAMPS 2018 presentation at https://osf.io/8d56n/ -- thx!

Alok-Shekhar commented 5 years ago

@mytluo , @ctb I am not getting how to interpret souurmash gather output especially f_unique_weighted, average_abund, median_abund, std_abund columns values, how to calculate %abundance value from the run, will you please explain.

  1. how to interpret --scaled option although its explained but am not getting correctly, will you please elaborate it for better understanding. Thanks A lot!
ctb commented 5 years ago

hi @Alok-Shekhar thank you for your questions!

scaled

re scaled, have you seen the explanations in Using sourmash: a practical guide and modulo hash details?

re abundance, what specific question(s) do you want to answer?

here's an example (from the test code) that might help:

In the tests/test-data/ directory, I run:

% sourmash gather gather-abund/reads-s10x10-s11.sig gather-abund/genome-s{10,11,12}.fa.gz.sig
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=21 automatically.
loaded query: 1-1... (k=21, DNA)
loaded 3 signatures.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz

found 2 matches total;
the recovered matches hit 100.0% of the query

here, the query reads were constructed to have 10 times as many reads from s10 as s11. And so that's what you see with p_query (91% of the query is genome s10, 9% of the query is genome s11). In terms of the average abundance (avg_abund), what you see is approximately the same relative relationship (10x as much in s10 as in s11).

For this example, there is no difference between p_query and avg_abund, because the s10 and s11 genomes are the same size. However, were s10 twice as big as s11, then the p_query should change while avg_abund should not.

Alok-Shekhar commented 5 years ago

Thanks! @ctb for the explanation Am I right, please let me know. f_unique_weighted columns values are considered as %abundance value? Thanks a lot!

Alok-Shekhar commented 5 years ago

@ctb Abundance info just like we get from kraken run from "% of fragments covered " column

Thanks!

ctb commented 4 years ago

@luizirber is doing a bunch of work on this for his thesis, so we can finally lay this to rest with benchmarks and (lots of) details. soon!

cesarcardonau commented 4 years ago

Hi @ctb @luizirber, since it has been a few months, I was wondering if you guys have any more details/documentation for interpreting abundances as mentioned above.

I am struggling a bit myself, picking up these differences. For instance, 1) How is the abundance weighing for f_unique_weighted work, if it seems that doesn't require that the signature is pre-computed with the --track-abundance flag. So how is it able to do it? or what is different from this abundance? 2) If I have two metagenomics signature files (with abundances), query1 and query2, and run sourmash gather for both using the same reference database, producing gather results, gather1 and gather2, respectively. Which is the better metric to use if (i) we want to compare relative-abundance of strain X in the gather1 and gather2 results (ii) we want to add the abundance of strain X in gather1 + gather2 results, and (iii) we want to add relative-abundance of strain X & Y from the same gather result file.

Thanks!

ctb commented 4 years ago

hi @cesarcardonau et al., let me update you all a bit on some documentation and testing updates!

first, our classifying signatures documentation has been improved. In particular the abundance weighting section is better, and the gather algorithm discussion at the bottom of the page should be informative.

Second, we've updated lca summarize to respect abundance info and elaborated our documentation.

I don't think either of these answers the questions above fully, but they might provide context.

On to the meat of the question --

ctb commented 4 years ago

There's a lot of questions in this issue, so let me try to answer some of them first and then iterate as people ask for clarification!

The gather-abund tests added in #886 are key. See this diff, but I interpret the important bits below:

sourmash gather and abundances

tl;dr Below is a discussion of a synthetic set of test cases using three randomly generated (fake) genomes of the same size, with two even read data set abundances of 2x each, and a third read data set with 20x.

Data set prep

First, we make some synthetic data sets:

then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance.

test_gather_abund_1_1 in test_sourmash.py

When we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r1+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
394.0 kbp     49.6%   78.5%       1.8    genome-s10.fa.gz
376.0 kbp     50.4%   80.0%       1.9    genome-s11.fa.gz

test_gather_abund_10_1

When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r2+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz

The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed.

ctb commented 4 years ago

I think the above should answer @mytluo first question.

answering @mytluo second question,

for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

Yes, that file is the list of hashes that were not classified / assigned to a genome. The percentages and so on referred to elsewhere in this issue should take that all into account, however, so you don't need to explicitly account for them yourself!

ctb commented 4 years ago

more @mytluo answers -

I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

Yeah, in partial response to this issue, I put in some tests to make sure I understood my own code :).

Per this line, f_unique_weighted is the abundance-weighted fraction of k-mers uniquely assigned to that gather match. See f_unique_to_query discussion here, which we then adjust to be weighted by the abundances.

ctb commented 4 years ago

the answer to @Alok-Shekhar question,

[how can we get] Abundance info just like we get from kraken run from "% of fragments covered " column?

is in #1022 and in the latest docs for lca summarize

ctb commented 4 years ago

ok I think I'm done for now, thanks for all the questions and please ask more!

cesarcardonau commented 4 years ago

This is very helpful, thanks so much @ctb

ctb commented 4 years ago

thank you for sticking with & please do ask for clarification :). we have an increasing number of good synthetic and real data sets that we can use to explore and demonstrate this stuff.

DivyaRavichandar commented 4 years ago

@ctb This is very helpful in understanding f_unique_weighted. Would it be reasonable to go to from f_unique_weighted to an approximation of reads annotated to a strain by f_unique_weighted*read_depth_in_sample?

ctb commented 4 years ago

my naive thinking is that f_unique_weighted * number of reads should be what you're looking for. but I haven't thought that through :)

ctb commented 4 years ago

closing for now - moved final unanswered question to #1079 for visibility.