sourmash-bio / wort

A database for signatures of public genomic sources
https://wort.sourmash.bio
Other
17 stars 2 forks source link

Calculating SAC on metagenome clusters #36

Open nmb85 opened 3 years ago

nmb85 commented 3 years ago

@luizirber, one more thing for today (not intending to distract you), it would be really interesting if you could calculate the species accumulation curve (SAC) for hash sets in clusters of metagenomes in your monster wort database. For example, when looking at soil metagenomes as a cluster, you could build a matrix of hashes (such as here), calculate different orders of intersection between hash sets from the soil metagenomes, and then plot an SAC from the hashes. While this might be impossible with kmers, and species tallies are corrupted by incomplete annotation due to incomplete databases, hashes might give you a chance to get an accurate SAC based on plotting the effect of incrementally adding hash sets and seeing the change in intersection sets. See equation 3 in this paper for a definitive explanation. Then you could efficiently use all the data in the SRA and JGI dbs to estimate if the species count based on current soil metagenome is "open" (SAC fits a power law function) or "closed" (SAC fits an exponential function), that is, whether or not we've collected enough data to estimate an asymptote for the number of species (in this case using hashes as a proxy) in soil metagenomes (or some other interesting biome). Although I'm not a soil biologist, I think that's a major question in their field. Other biomes might be interesting too. Not sure if anyone has tried this with raw kmers, but it would seem too gargantuan of a task. Hashes might make this problem tractable?

luizirber commented 3 years ago

That is a really good idea... and a monstrous matrix :rofl:

I'll work on sharing all the sigs in a couple of weeks, but it is not something I can tackle at the moment :cry:

ctb commented 3 years ago

yes! we explored this quite a bit a while back for tara, see https://github.com/ctb/2017-sourmash-rarefy/blob/master/tara-rarefy.ipynb for an example. Haven't looked at the code in a while tho ;).

nmb85 commented 3 years ago

Have you already seen this? https://ieeexplore.ieee.org/abstract/document/9139876