Open hammer opened 5 years ago
For the integration w/ external ontologies, I was surprised to learn the iX paper used their own "ontology" of cytokines and their receptors. It's available via ImmPort's Cytokine Registry, in case you haven't seen it yet.
A couple other possibilities:
Oh yea I did see that but not until I had already gone more down the MyGene path. I should probably have used that registry instead. Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?
Store mentions by section of paper a la CancerMine to make it possible to segregate new and old findings
Good point on partitioning by section of paper! Open Targets does the same thing. They use a horrifying Perl script called SectionTagger detailed in the paper Section level search functionality in Europe PMC (2015).
We can also use paper metadata to distinguish e.g. reviews from new research results.
Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?
Yeah did you read the sections in the methods and supplementary notes of the iX paper where they explain how they use "seed phrases" and incremental expansion of those phrases to identify cell types and then map them back to CL? CL definitely seems like a pretty poor ontology relative to GO and others.
Oh wow that's a gnarly script.
Re: cell type matching -- I didn't see that when I was just hoping to find synonyms to pull in, but after a closer look I think they're essentially saying they simply matched against the CL names + synonyms in a way where order doesn't matter with a preference for the most specific concepts (and if the candidate has a string not in any name/synonym it is ignored). They explain a scoring system they put together for the candidate matches in Supplementary Note 3 but then at the end of it say they ultimately only take matches with a perfect score of 1. Given the definition of the score, it seems to me like that would then barely be any different than searching for the strings for the names/synonyms directly. Presumably the precision drops off a cliff with a threshold any lower than that. The seed + typed dependencies idea is cool though, maybe there's a way to build on that.
It could also be interesting to look at coreference resoultion and relation extraction across sentence boundaries, e.g. Inter-sentence Relation Extraction for Associating Biological Context with Events in Biomedical Texts. Obviously this will likely be very hard so probably not something to explore now, just filing away for later.
I was doing some thinking about our upcoming call w/ @ajratner after our discussion yesterday. Two topics that we discussed: are there any ways to automate label function generation and could we make use of science corpus-specific training data in the models that feed the label functions and relation extractor?
For the first topic, I found two papers from Paroma Varma, another Chris Ré student, that may have some ideas for us. I think both papers describe the same system, called "Snuba" in one and "Reef" in the other. Code is at https://github.com/HazyResearch/reef. Paroma only lists the Snuba paper on her website, so maybe just read that one. We should also bring up w/ Alex the idea of making use of structures used by QA systems to inform candidate heuristic generation (in the language of Reef/Snuba).
For the second topic, I reread the ScispaCy and SciBERT papers last night and feel like we have to get some lift from making use of these pretrained models, even if they don't have precisely the entity types we need for NER. Table 8 in the ScispaCy paper shows that their custom rule-based tokenizer and domain-specific sentence segmenter massively improve the basic task of sentence segmentation. I wish they gave more details on the en_ner_jnlpba_md
training process (e.g. do they fine-tune one of their en_core_sci
models? Edit: looks like the code is in train_specialised_ner.py), as it would be interesting to see if we can use the same procedure to make an en_ner_tcellrel
model for spaCy. Finally, Table 3 from the SciBERT paper shows a huge lift for the same ChemProt relation extraction task that Snorkel uses as a demo/example, so it seems to me that if we replace the Snorkel biLSTM w/ a model that uses SciBERT embeddings we should do a lot better for free.
One more project that may be interesting to discuss w/ Alex: Babble Labble. Perhaps writing label functions would be less laborious if you were authoring them in natural language? More interesting than that, though, is whatever data structure represents the parsed natural language that is used to generate the label function. That intermediate representation could be the right target for candidate heuristic generation.
Doing a bit more reading on fine-grained entity recognition. The foundational paper in this field seems to be Fine-grained entity recognition (2012) by Xiao Ling and Daniel Weld. What's interesting for us is that they do fine-grained NER in support of relation extraction and show a significant gain in performance on RE.
FWIW I decided to look into the GO terms that correspond to the relations we're learning and I think these map to secretion and differentiation, though I don't know if there's a way to be more specific about cytokine-induced differentiation versus TF-induced differentiation...
Oh I should also dump the Python libraries I've seen for working w/ ontologies
Musing about a few things we could do for the next manuscript update