obophenotype / upheno

The Unified Phenotype Ontology (uPheno) integrates multiple phenotype ontologies into a unified cross-species phenotype ontology.
https://obophenotype.github.io/upheno/
Creative Commons Zero v1.0 Universal
76 stars 17 forks source link

generic uPheno classes for all taxon-specific phenotype classes #729

Open mikebada opened 3 years ago

mikebada commented 3 years ago

Hi all,

I've recently been having a discussion regarding uPheno with @matentzn, who requested I write this up as a uPheno issue here.

We've put a lot of work into creating gold-standard (manual) annotations of concept mentions in biomedical journal articles for our CRAFT Corpus, relying on the classes of select OBOs, and we'd like to continue our work by analogously annotating mentions of phenotypes in these articles. We've done some preliminary work for this task and are finding it pretty difficult for multiple reasons, one of which simply is the fact that (we've found that) annotation becomes more difficult as the complexity of the concepts increases, but that's the nature of many phenotypic concepts.

We've also found that our annotation is significantly facilitated when using taxon-nonspecific classes; for example, we've already extensively annotated the corpus using the anatomical classes of Uberon and the taxon-nonspecific protein classes of the Protein Ontology. However, for uPheno, although there's substantial integration of the taxon-specific phenotype classes into a hierarchy of taxon-nonspecific grouping classes (e.g., UPHENO:'abnormal abdomen morphology'), there seem to be many taxon-specific phenotype classes that don't have analogous generic classes; for example, HP:'Ocular hypertension' and MP:'ocular hypertension' classes are integrated, but there's no generic uPheno ocular hypertension class subsuming these. It also looks like some of the specific phenotype ontologies aren't as integrated as others, and it'd be great to have integration under generic uPheno classes among a wide range of taxa (ideally among all taxa), e.g., a generic uPheno increased body size class that subsumes MP:'increased body size', FBcv:'large body', and any other corresponding classes so that a textual mention of increased body size can be straightforwardly annotated with a generic uPheno class, without having to figure out the taxon. That being said, I'm sure that all of the uPheno integration already accomplished constitutes an enormous amount of work, so I don't at all wish to come across as critical.

So, I've been discussing with Nico the possibility of the creation of generic uPheno classes that subsume all corresponding existing taxon-specific phenotype classes, which he said is a reasonable request--to my surprise, as it seems to me like another huge amount of work! We wouldn't need the generic uPheno phenotype classes to have logical definitions for the near-term task of text annotation, if that would help (though it'd be useful to eventually have them where possible, as we believe they may be useful for downstream automated methods). We could also eventually compile a list of phenotypes appearing in our corpus that we'd like to conceptually annotate and for which there are taxon-specific classes already in uPheno, if it would help to have some kind of prioritized set.

Please let me know if there are any questions or if discussion would be helpful, and thanks for your consideration!

Mike

matentzn commented 3 years ago

This is great, thank you for the detailed ticket!

Would you be up for helping a bit with the solution? I have brought your case up last week at our phenotypes call, and there is one critical limitation/feature of uPheno2 that does not allow blindly creating new taxon-_UN_specific terms - we need to be able to assign the phenotype to a pattern. You can browse the list of currently supported patterns here to get a picture: https://github.com/obophenotype/upheno/tree/master/src/patterns/dosdp-patterns

What I would need from you is basically this, if you agree: Whenever you want to annotate a term for which no species agnostic term exists, add it to a google sheet (I will show you which). Every week, we will assign a few curators to "patternise" your terms: For example, you add HP:Ocular hypertension; then we will have someone create a suitable pattern; once the pattern exist, the term will show up in uPheno a few weeks later.

In your annotation practice, you could simply use the MP/HP/DPO or whatever term you found that is analogous to your use case; and once term term is in Upheno, you can run a big replace all script that replaces those ids with uPheno ids.

Would that work for you?

mikebada commented 3 years ago

We would certainly be able to help with this. (@nicolevasilevsky will likely be doing most of the annotation work.) Btw, just so I understand, did you mean to write that the critical limitation/feature doesn't allow blindly creating taxon-nonspecific terms without patterns?

We've so far done only some preliminary annotation work just to show the Translator folks that we'll be able to annotate phenotype mentions in our corpus. We won't really get going until sometime next year (as we have other annotation work that should be finished first), so there's no rush right now, but for now it'd be really useful for us to at least develop a more thorough plan.

With regard to annotation of our corpus, it sounds like between your and our efforts we'd be able to create pretty much all of the generic uPheno classes we'd need (which I estimate would likely be on the order of hundreds of classes, perhaps even low hundreds). So, along with the temporary use of taxon-specific classes and subsequent substitution with generic classes as you suggested, I'm pretty confident we'd be able to successfully annotate the corpus, which is great. (There might be some issues for classes for which there are currently no logical definitions, but we could deal with that later.)

What I'm much more concerned about is what we'd use later on for automatic annotation of phenotype concepts in text. In fact, our plan is to mine as much of the biomedical literature as we can obtain for phenotypes (as well as for concepts of other OBOs we've already used to annotate our corpus). For this, we'd of course need to use all of uPheno, in which case there'd need to be a generic uPheno class for every phenotype for which there's at least one taxon-specific class. (Again, the idea would be a generic phenotype ontology analogous to Uberon.) I realize that this would be a huge undertaking and thus would understand resistance to it based on that alone. However, even if we could successfully use uPheno with the aforementioned methodology to annotate our corpus, I think it may not be the way to go if there won't be a complete taxon-nonspecific phenotype ontology eventually. Do you have any thoughts/intuitions as to whether or not this might happen eventually?

Thanks again for your consideration!

matentzn commented 3 years ago

blindly creating taxon-nonspecific terms

Oh yes, sorry, corrected that in the text.

There might be some issues for classes for which there are currently no logical definitions, but we could deal with that later.

Let us worry about that - we have an insanely productive team here that will be able to solve this - if it is solvable (it sometimes just is not. But in that case, we will still solve it - with a technical solution.

Your concerns for what will happen when the automatic pipelines hits are right. However, I would say this is more or less an extension of the manual issue before; If you start early with this process, we can measure how many new classes are actually needed. So you use basically all of uPheno for matching an when you match on a species-specific class, and if there is no species-non-specific class, we work on fiddling it in.

To be honest, due to a mistake in the framework, we do have many more species independent classes then we should (if there is 1 taxon-specifc class with a logical definition, there will be a non-specific class as well. This will now accidentally benefit your case.. Would you be open to just look at how many non-specific classes we would be missing? We have a ton of resources to work on stuff like that, so, I would just say 'lets try' :D