Raw phenotypes and semantic similarity matrix for one entity (Aim 1)

uyedaj commented 6 years ago

For Aim I, @sergeitarasov and @uyedaj need: An entity with good data coverage in phenoscape (e.g. dorsal fin) and all of the raw phenotypes and taxa.

We then need the raw phenotypes made into a semantic similarity matrix (Jaccard's) by @balhoff that gives us all pairwise similarities for all the phenotypes that we have.

wdahdul commented 6 years ago

Paula and I found that the term 'pectoral fin spine' for Siluriformes (catfishes) has broad categories of qualities associated with it, and the corresponding tree is available in OpenTree.

@balhoff can you send @sergeitarasov and @uyedaj the datasets mentioned above for 'pectoral fin spine' and 'Siluriformes'?

wdahdul commented 6 years ago

@sergeitarasov @uyedaj
For the tree, there's a ‘catfish’ collection in Open Tree with 13 trees that we uploaded a few years ago corresponding to taxa with data annotated in Phenoscape. We need to get a customized and synthetic tree for catfishes that includes the curated phylogenies in the tree collection. As I understand, this needs to be requested from OT. Have you done this before? I’ve asked Laura for more details, can update you next week.

sergeitarasov commented 6 years ago

I used OT but I am not familiar with all details yet

wdahdul commented 6 years ago

I chatted with Laura about requesting a custom tree from OT. She confirms that this needs to be requested (there’s been mention of adding a new feature allowing users to synthesize their own custom tree, but this isn’t implemented yet). Laura mentioned several options that she specified (e.g., include all incertae sedis taxa, remove monotypic branches (for reconstruction), remove OTT IDs from taxon names, and exclude subspecies. Having said that, if you get the tree from OT then there's the taxon reconciliation to work out, might be better to wait for the potential Catalog of Fishes update with OT.

pmabee commented 6 years ago

Yes. Details as per below I. The Jackson et al paper too. I’m scheduled to talk with catalog of fishes tomorrow Paula

Sent from my iPhone

On Oct 3, 2018, at 2:10 PM, Wasila Dahdul notifications@github.com<mailto:notifications@github.com> wrote:

I chatted with Laura about requesting a custom tree from OT. She confirms that this needs to be requested (there’s been mention of adding a new feature allowing users to synthesize their own custom tree, but this isn’t implemented yet). Laura mentioned several options that she specified (e.g., include all incertae sedis taxa, remove monotypic branches (for reconstruction), remove OTT IDs from taxon names, and exclude subspecies. Having said that, if you get the tree from OT then there's the taxon reconciliation to work out, might be better to wait for the potential Catalog of Fishes update with OT.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/phenoscape/scate/issues/7#issuecomment-426761831, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACObXlQh24cdLWdaAdMwu7PxooBxbPgMks5uhQuegaJpZM4W04fQ.

balhoff commented 6 years ago

@uyedaj @sergeitarasov I've created a preliminary service for accessing similarity scores comparing pairs of character states. We don't currently have stable global identifiers for each character state, so for now I implemented the service to accept several parameters that specify two states ("left" and "right"): study, character number (1-indexed), and state symbol. If these don't work for you, we can discuss better designs for this endpoint. Here's how you can call it:

#!/bin/bash

STUDY='http%3A%2F%2Fdx.doi.org%2F10.1111%2Fj.1096-3642.2008.00407.x'
CHAR=158

curl "http://kb.phenoscape.org/api/similarity/states?leftStudy=$STUDY&leftCharacter=$CHAR&leftSymbol=0&rightStudy=$STUDY&rightCharacter=$CHAR&rightSymbol=1"

You can compare states from different studies; I just compared two from the same character here. This method will require you to make pairwise calls for all the states you want to compare. We can talk about better ways to specify the whole set at once, but this should allow you to get started.

sergeitarasov commented 6 years ago

I tried it and it works. It's definitely a good start for us to get the access to the similarity. But perhaps we may discuss on how to improve it by, for example, getting entire similarity matrix using only one query.

hlapp commented 6 years ago

I would argue that what we should be returning instead is a binary-encoded matrix of each character state against each ontology class (X(i,j) == 1 if ontology class i subsumes one of the terms annotated to state j and 0 otherwise). It is very easy from there in R to compute the Jaccard similarity through simple vector algebra.

balhoff commented 6 years ago

@sergeitarasov for getting a whole matrix back, what we would need to work out would be how to specify the set of states which you want to pairwise compare. For this use case would that simply be the identifier for a publication which has been annotated in the KB?

balhoff commented 6 years ago

I would argue that what we should be returning instead is a binary-encoded matrix of each character state against each ontology class X(i,j) == 1

@hlapp this would probably be a very large payload—currently the KB contains 490903 named classes, and this will probably grow. Would the result format need to identify each class (many of which fall more into the category of implementation detail than global concept)? Or could it simply guarantee that the same index is the same class for any given KB deployment?

uyedaj commented 6 years ago

@balhoff I don't think we want to limit it only to a single publication, but that might be a good start. An example use case is that I would want to look at a specific clade of fish and a specific entity, e.g. caudal fin, and get all annotated phenotypes from the KB. Then I would want to get a matrix of their semantic similarity using this service. This may combine multiple publications.

hlapp commented 6 years ago

@balhoff I think to be reproducible the row and column ordering would need to be stable if obtained repeatedly from the same KB deployment. I think it can (and arguably should) be allowed to change from one version of the KB to the next, or one deployment to another unless the data on which the deployments were built were exactly the same.

The (0,1) matrix would be very sparse – the vast majority of cells would be expected to be zero, and in principle there are (space-)efficient representations for sparse matrices like these. In our case I think we can go further: any row (ontology class) that is zero for all columns requested can be omitted with no effect on at least the Tanimoto (and thus Jacard) metric, and arguably any other sensible similarity metric one would conceive with this, too (because a sensible metric would probably not take into account the ontology terms that neither of the entities in a pairwise comparison is annotated with).

tjv commented 6 years ago

@hlapp just wanted to add an important but possibly non-obvious caveat - both the direct and subsumed annotations to a class need to be zero or else one cannot ignore that class in similarity calculations.

hlapp commented 6 years ago

both the direct and subsumed annotations to a class need to be zero or else one cannot ignore that class in similarity calculations.

Do you mean classes annotated to a phenotype description? If you meant annotations to in the other direction (from a phenotype description to an ontology class) then annotations don't subsume each other. Ontology classes do.

Either way, for any class description (whether a class or a class expression), it can be removed from the matrix if it is neither annotated to any of the phenotypes (columns) nor subsumes a class description annotated to any of the phenotypes. In other words, a class description that isn't on a path from annotated class description to root of the ontology graph for any of the phenotypes being looked at can be removed from the matrix. (This is perhaps what you meant?)

phenoscape / scate

Raw phenotypes and semantic similarity matrix for one entity (Aim 1) #7