waldronlab / BugSigDB

A microbial signatures database
https://bugsigdb.org
7 stars 6 forks source link

Calculate some similarity measures #43

Open lwaldron opened 4 years ago

lwaldron commented 4 years ago

We would like to be able to store relevant similarity measures between signatures, but are not sure what measures we will want in the future. These will be updated regularly in the future as new signatures are added, through the wiki API if this is possible. For now we should have:

  1. Jaccard Index
  2. Number of overlapping taxa

@lgeistlinger would you create a file of similarity indices based on the bugsigdb.org dump? I think we want to leave open the possibility of adding new similarity measures in the future. @tosfos would this be a good format?

signature1 signature2 jaccard number

These will be used to link to "similar" other signatures from the signature pages.

lwaldron commented 4 years ago

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

seandavi commented 4 years ago

When "counting" shared and non-shared taxa, what is the "rule" to use when some taxa are not at the same taxonomic level?

On Thu, Nov 19, 2020 at 10:43 AM Levi Waldron notifications@github.com wrote:

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/waldronlab/BugSigDB/issues/43#issuecomment-730460324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWSE6YIVYGSENWXLAI63DSQU4JNANCNFSM4T3RY2UQ .

tosfos commented 4 years ago

@tosfos would this be a good format? signature1 signature2 jaccard number

Yes. This is tricky to store with Semantic MediaWiki but we'll dream something up.

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

That would be way better! And way cooler.

lwaldron commented 4 years ago

When "counting" shared and non-shared taxa, what is the "rule" to use when some taxa are not at the same taxonomic level? (@seandavi)

That is a good question - the little bit of bug set enrichment analysis I've seen just ignores taxonomy, which is obviously not correct but could be useful in this context anyways if we don't try to attach a p-value, or limits to a single taxonomic rank. I can't think of anything better that would be straightforward - thinking of things like unweighted UniFrac distance (https://en.wikipedia.org/wiki/UniFrac) which measures phylogenetic distance between two microbial communities, and Ancestral State Reconstruction to compare mixed taxonomic levels. It leaves me thinking that just for a basic purpose of showing similar signatures, which are mostly either genus or species-level, Jaccard might be good enough? We'll end up with species-level signatures (WMS) always being dissimilar to genus-level signatures (16S), but I'm not sure right now what we could do about that.

lgeistlinger commented 4 years ago

When "counting" shared and non-shared taxa, what is the "rule" to use when some taxa are not at the same taxonomic level?

Sounds to me like an "argument to the function". So far, we only considered exact matches, in the sense that they have the same NCBI ID. Of course, your similiarity measure calculation could allow for eg going up/down 1, 2, ... levels of the taxonomy to declare overlap.

lwaldron commented 4 years ago

Let's close this for now just to make space for priority issues. First priority now will be transferring curation over to bugsigdb.org.

lwaldron commented 3 years ago

We could open this issue again. The essentials have been taken care of, and this would be a great enhancement. I've tested simple Jaccard Index and it seems to produce pretty intuitive groupings. It's rather heavy to calculate it for all pairwise combinations of signatures so they would have to be pre-computed, and then computed only for signatures that are added or changed. I liked it better than other simple alternatives like intersection length over minimum length, or simple intersection. But it should be designed to allow for supporting other similarity measures in the future (for example, genus-level only Jaccard index).

lgeistlinger commented 3 years ago

@tosfos @lwaldron : how do we proceed with this? Are we proceeding along these lines:

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

That would be way better! And way cooler.

I am noting in this context that I recently played around with the concept of semantic similarity. Here we are treating the NCBI taxonomy as an ontology and calculate pairwise similarity between signatures or groups of signatures as eg implemented in the ontologySimilarity package.

lwaldron commented 3 years ago

Since this is an enhancement, it should be lower priority than getting the home page and all basic functionality in place. Eventually after the other key functionality is implemented, we may want to discuss whether semantic similarity is practical to implement, because it is elegant and probably will provide more relevant similarities than Jaccard for mixed-taxonomy signatures.