renaud / neuroNER

named entity recognizer for neuronal cells, based on UIMA Ruta rules
GNU Lesser General Public License v3.0
7 stars 8 forks source link

Leverage the work done in OBO community to build interoperable ontologies #48

Open cmungall opened 8 years ago

cmungall commented 8 years ago

Apologies if I am misconstruing something here, but it seems there is a lot of effort going on here to duplicate ontologies that exist in the OBO library -- ontologies that are widely in use for major OMICs projects like ENCODE, FANTOM5 and LINCS, as well as projects like Virtual Fly Brain.

Specifically, for cell types, the OBO Cell ontology (http://obofoundry.org/ontology/cl.html) has many neuronal and glial cell types, and has a web based interface to register additional ones. It is well-maintained, and includes a rich array of synonyms for NER purposes. We are planning to align this with NIF-Cell and neurolex neuron subsets (which does not seem to be developed any more).

I notice that your neuron_registry.csv file has many neurons that came originally from http://obofoundry.org/ontology/fbbt.html (the drosophila anatomy ontology, which includes many fly neurons, and is the ontology that drives Virtual Fly Brain). I suspect that this came to you by way of neurolex. Unfortunately there were errors in the import, and it was stale a while ago. I recommend you get fly neurons direct from the latest official fbbt.

I would also point you at uberon.org, the brain subset of which is the successor to NIF-GrossAnatomy, but includes many more brain regions, in mammals and other organisms, with taxonomic scope well-indicated. It is aligned with resources such as the FMA, medical terminologies like SNOMED, as well as 5 different Allen atlases. The alignment was performed very carefully, using OWL-based reasoning to detect inconsistencies and false-positive matches due to homonyms. It also incorportates all the terms from these, to enhance NER. We have a draft page on how to use Uberon for NER: https://github.com/obophenotype/uberon/wiki/Using-uberon-for-text-mining. Uberon is used by many major OMICS efforts, and is the cornerstone of gene expression databases such as BgeeDb -- http://bgee.unil.ch/

Of course, CL and Uberon cover more than nervous systems, but we have technology for extracting slices for the ontology to cover either specific anatomical systems, as well as specific taxa, and in general for making application ontologies geared towards purposes such as yours.

Some of the other ontologies also seem duplicative, for example hbp_developmental_ontology.obo ==> http://obofoundry.org/ontology/hsapdv.html (an ontology of human developmental stages, co-developed with BgeeDb). Also proteins, genes, chemicals etc.

I also note you're using hand-edited obo files. I am one of the people responsible for obo format, so I understand precisely the appeal here. However, I can state from bitter experience this is not a good long term path for you.

I realize that I may be misunderstanding requirements and I may be catching a project in initial exploratory phases. Nevertheless, I have seen a number of these projects come and go, with honest efforts gone to waste where increased collaboration and broader understanding of the resource landscape could have helped all involved.

Apologies in advance if this sounds in any way critical, and if it seems like I am meddling, I offer this purely in a spirit of collaboration and to offer help in any way I can (and to offer this help on behalf of the many domain experts and ontology editors in the OBO community, who would be thrilled to help). You should feel free to close this ticket regardless, sorry if this is not the best way to reach out.

mellybelly commented 8 years ago

I would concur on the offer of assistance. I also would like to point out that the practices that Chris describes also ensure proper attribution and provenance for the contributions - based on many lessons learned over the years. Much of the windy path amongst Fbbt, NF-cell, neuron registry, Neurolex, and neuroNR means that the person who originally contributed the content and its original contextual meaning (available if proper provenance is available) are lost. Please please work with the community to help us all ensure proper attribution and provenance as funding for standards development, reproducibility, and contribution measures all depend upon it.

renaud commented 8 years ago

Thanks @cmungall and @mellybelly for your inputs.

To be honest, @stripathy and myself thought quite a bit about how to best structure our ontological resources. Still, we are not really satisfied and would welcome some help.

What pragmatic approach would you recommend here?

Background: we plan to integrate the neuroNER resources with the upcoming new version of neuroLEX. (therefore, adding @cathzwah and @tgbugs to this conversation)

Thanks again, Renaud

fbastian commented 8 years ago

Hi @renaud, would you be interested in meeting directly in Lausanne? (I work at UNIL, and often collaborate with @cmungall and @mellybelly) That would be really great to have you on board.

tgbugs commented 8 years ago

Hi. Wanted to quickly address some points you made @cmungall. @renaud correct me if I say anything out of line. My plan is more or less as follows, any input would be welcomed. 1) we want to move away from obo to ttl 2) wherever these entities ultimately live we need a way to handle multiple (order of 10s) property based classifications (ephys, morphology, gene expression, etc), we will reuse existing identifiers when we can find them (eg brain area based classifications will use uberon identifiers). 3) we will likely want to materialize many cell types that have multiple property based classifications so that we can add synonyms 4) if there are existing identifiers for cell types we will reuse them

cmungall commented 8 years ago

@tgbugs -

  1. (sorry if I'm getting into the weeds a little) - is this owl layered over ttl? I'm thinking specifically about more complex axioms, some of which are tempting to code as direct ttl edges, but ultimately buys you less than a more complex encoding. and do you intend to hand edit the ttl, use Protege or topbraid or... there are
  2. absolutely! we're all about multiple axes of classification, and inferring these automatically
  3. ideally we'd find a way to share these synonyms
  4. great! when you say share I take it you mean using directly as the class IRIs/CURIEs (rather than xrefs, which have weaker semantics, get stale, etc... but are better than nothing)

@renaud -

thanks for your swift response. I can give you some generic answers to your question about pragmatic next steps, it would be good to get the requirements documented and all that good stuff, just so we don't go down any blind alleys. We could also try a few things like making a fork and pull requests where we swap out chunks of ontologies here and there, perhaps adding something to the repo to allow for syncing the appropriate chunks (I don't know enough of your dependencies to start messing around here). We could try this on something that's less essential for you like dev stages at first (neurons unsurprisingly turn out to be the hardest).

If you haven't already, I'd recommend opening fbbt in Protege and exploring some DL queries, the paper is a good guide here (and doing some things simultaneously on the VFB site). Protege isn't brilliant for exploring lexical aspects like synonyms, or for visual exploration. Try http://www.ebi.ac.uk/ols/beta/search?ontology=fbbt for that.

Even if your interest isn't flies, I think fbbt is a great 'model system' for neuron ontologies, and we are following the patterns used here for mammalian and other vertebrate neurons in CL (we just haven't done anywhere near the same level of work on them yet). @dosumis can provide additional pointers.

The main thing is that your open to exploring this further, which is great!

cmungall commented 8 years ago

@fbastian is indeed local, and can answer all your questions on Uberon and the wider OBO space

tgbugs commented 8 years ago

@cmungall

  1. owl serialized to ttl if that answers your question, edited by hand or using protege but serialized deterministically using owlapi 4.0+ (see my crappy code here)
  2. direct reuse of identifiers where even remotely possible for the reasons you give and because it makes attribution and prov much easier to track (worst case we would resort to equivalence class assertions like those that live in some of the old nifstd bridge files)
stripathy commented 8 years ago

Thanks for the comments, and especially the offer to provide help and ontology expertise, @cmungall and @mellybelly!

@renaud and I just had a long skype call today discussing the issues you raised and we're putting together a specification document to define the scope of what we're doing. Part of this will include a benchmark dataset of hand-curated spans corresponding to different neuron types : https://github.com/renaud/neuroNER/issues/51. This'll help describe how what we're doing is different from previous approaches and (we think) makes a meaningful contribution to this difficult problem. Maybe after we share the document around, we could have a skype call and interested parties could further discuss.

dosumis commented 8 years ago

Chris wrote:

Even if your interest isn't flies, I think fbbt is a great 'model system' for neuron ontologies, and we are following the patterns used here for mammalian and other vertebrate neurons in CL (we just haven't done anywhere near the same level of work on them yet). @dosumis can provide additional pointers.

Happy to discuss if you're interested.

stripathy commented 8 years ago

Hello all,

I wrote up a short response to a lot of the issues discussed above. Your comments and feedback are appreciated.

I noticed there were two issues discussed above:

  1. Defining the goal and scope of mine and @renaud's project, to represent neuron types mentioned from the literature at a very fine grain resolution
  2. Re-use and potential duplication of existing of ontologies

Project Goal Our goal was to build a tool that can capture the fine-grained differences amongst different mammalian neuron types described in the neuroscience literature. For example, between : “Barrel cortex layer 5 corticothalamic cell” vs “Neocortex layer 6 Ntsr1-expressing corticothalamic cell” vs “barrel cortex layer 5 somatostatin expressing interneuron” (These are all actual instances from the literature). Our approach to this problem is entirely compositional – we try to separately identify and normalize each of the defining components, like morphology, marker gene/protein expression, projection patterns, brain regions, etc (you can read more about our approach here: https://www.dropbox.com/s/idq66ksggyrr5a8/standalone_neuroNER.pdf?dl=0). As far as I can tell, it's identical to the "property-based" classification scheme defined by @tgbugs above.

Thus two neuron mentions can share similar brain regions and morphologies but be different their in other features, like their projection patterns or which genes they express. Also, we’re not trying to embed any extra semantics, above and beyond identifying and normalizing the features defining each neuron instance. For example, while it’s the case that most/all “fast spiking cells” are also “parvalbumin-expressing cells”, we treat these as separate domains (i.e., we don’t use one as a synonym for the other, as is commonly done in NeuroLEX). What’s nice about this approach is that can deal with new neuron types very easily. For example, with no extra work, we can represent each of the new neuron types defined by the recent Allen Institute Cortical Cell Types paper (http://www.nature.com/neuro/journal/vaop/ncurrent/full/nn.4216.html) because they’re implicitly defined through the conjunction of marker gene expression, brain region, and cell layer, and we already know how to normalize each domain independently.

@mellybelly and @cmungall, how can we reconcile our approach with Cell Ontology? Would it require adding a class for each unique instance that we can define (it’d be >1K instances probably).

On reuse and duplication of ontologies For each type of feature used to define neuron types (like morphology, brain region, etc), we need to have list of concepts and appropriate synonyms for each. We’re mostly using the .obo resource files listed here (https://github.com/renaud/neuroNER/tree/master/resources/bluima/neuroner) for this. We initially made each of these resources in a very quick, ad hoc manner (you can read about it here: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7363910&tag=1). But in some cases, we did make a conscious effort to reuse existing identifiers, like using NCBI identifiers (and synonyms from UniProt) for mouse genes and Allen Brain Atlas identifiers for brain regions. For other domains, like development stages or neurotransmitters, for the sake of making progress quickly, we made new terms and identifiers for these and manually added synonyms as needed. However, our eventual plan was to reconcile these with existing ontologies’ identifiers once we were happy with the terms and synonym lists.

After the discussion with you all, it’s clear that we should be using the Uberon identifiers for brain regions. And we’ll reuse the mentioned ontologies for the other domains as well.

A comment about workflows: a few of you said that the text file .obo approach is not really scalable in the long run. We’re open to alternatives and would be happy to switch to something else, so long as it doesn’t completely derail our current workflow.

mellybelly commented 8 years ago

Sorry its taking me a while to reply, more to come. lets schedule a call.

First, I don't see where any of your requirements are not representable using standard OWL classification strategies. Thats good - it means you can easily adopt existing best practices to get done what you need to do. And, most of what you need already exists.

For example:

..."neuron identity as compositional, or that neuron types are defined through conjunctions of modifying statements which span various domains, like morphology, electrophysiology or neurotransmitter released. For example, a "Neostriatum cholinergic cell", is a neuron that expresses "acetylcholine" and is located in the "Neostriatum". Such a neuron is semantically equivalent to "cholinergic neurons in the neostriatum"

=> This is standard based OWL classification and has been used extensively throughout all of the biological ontologies and in particular for neuroscience and cell ontologies. From CL: 'neocortext basket cell' Equivalent To: ' basket cell and (part of some neocortex)'

In this way you can define using OWL axioms any class you like from other classes. This is at the heart semantic interoperability and why ontologies are so useful for data integration and analysis. See for example some of the fly neuron classifications:

http://www.ontobee.org/ontology/FBBT?iri=http://purl.obolibrary.org/obo/FBbt_00001491

(better to open in Protege but faster to look in Ontobee)

Related to how to classify anatomy and cell types: @dosumis fly anatomy ontology paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015547/ and neurorelations paper: http://bioinformatics.oxfordjournals.org/content/28/9/1262.long

use of expression for cell classification: http://www.ncbi.nlm.nih.gov/pubmed/24004649

The uberon doc wiki: https://github.com/obophenotype/uberon/wiki/Manual

Related to NER and use of ontologies: https://github.com/obophenotype/uberon/wiki/Using-uberon-for-text-mining http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-125 http://blog.phenoscape.org/2013/03/30/report-from-tucson-characters-to-annotations-text-mining/

Kyle's thesis on NER from the literature using the neuron registry: http://digitalcommons.ohsu.edu/etd/896/ http://www.sciencedirect.com/science/article/pii/B978012388408400006X (though note neither he nor Aaron are ontologists, but informative anyway)

A few additional readings: The OWL primer: https://www.w3.org/TR/owl2-primer/ Fundamentally you need to gain a better understanding of how OWL works. There is an upcoming Manchester tutorial that would be very helpful, or a tutorial at ICBO. Or I can send materials.

mellybelly commented 8 years ago

p.s. its fine to add 1K terms, as long as the logic is quality ;-)