Future directions - Githubissues

hammer commented 5 years ago

Musing about a few things we could do for the next manuscript update

Extend analysis to full open access subset of PubMed + bioRxiv
Include metrics about how label functions improve over time, as well as a qualitative description of lessons learned
Link cytokine, TF, and cell type terms and induce/secrete relations to external ontologies (e.g. PRO, CL, GO)
Make use of cytokine family information somehow
Try to identify emerging T cell types
Add disease association
Add chemokine receptor expression and other markers of spatial localization
Make use of the manually collected Flow Repository and GEO data sets that support the T cell types somehow
Identify a prediction made by the model to test in the laboratory

hammer commented 5 years ago

For the integration w/ external ontologies, I was surprised to learn the iX paper used their own "ontology" of cytokines and their receptors. It's available via ImmPort's Cytokine Registry, in case you haven't seen it yet.

eric-czech commented 5 years ago

A couple other possibilities:

Store mentions by section of paper a la CancerMine to make it possible to segregate new and old findings
Consider using word embeddings to resolve cytokine, TF, and/or cell type mentions rather than fuzzy matching against giant lists of synonyms similar to how WordNet is used in Synonym Expansion for Large Shopping Taxonomies to resolve context-independent terms to nodes in a taxonomy (context-independence is a safe assumption with cytokines and cell types I think, but I'm not so sure about TFs with synonyms like "genesis" or "NER")
See how much of this information has already been captured by sciBERT

eric-czech commented 5 years ago

Oh yea I did see that but not until I had already gone more down the MyGene path. I should probably have used that registry instead. Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?

hammer commented 5 years ago

Store mentions by section of paper a la CancerMine to make it possible to segregate new and old findings

Good point on partitioning by section of paper! Open Targets does the same thing. They use a horrifying Perl script called SectionTagger detailed in the paper Section level search functionality in Europe PMC (2015).

We can also use paper metadata to distinguish e.g. reviews from new research results.

hammer commented 5 years ago

Looking at CL though I can't seem to find newer/rarer cell types like tissue resident or stem memory T cells so maybe it's always a good idea to assume an identifier space specific to the project and then map to an external ontology where possible?

Yeah did you read the sections in the methods and supplementary notes of the iX paper where they explain how they use "seed phrases" and incremental expansion of those phrases to identify cell types and then map them back to CL? CL definitely seems like a pretty poor ontology relative to GO and others.

eric-czech commented 5 years ago

Oh wow that's a gnarly script.

Re: cell type matching -- I didn't see that when I was just hoping to find synonyms to pull in, but after a closer look I think they're essentially saying they simply matched against the CL names + synonyms in a way where order doesn't matter with a preference for the most specific concepts (and if the candidate has a string not in any name/synonym it is ignored). They explain a scoring system they put together for the candidate matches in Supplementary Note 3 but then at the end of it say they ultimately only take matches with a perfect score of 1. Given the definition of the score, it seems to me like that would then barely be any different than searching for the strings for the names/synonyms directly. Presumably the precision drops off a cliff with a threshold any lower than that. The seed + typed dependencies idea is cool though, maybe there's a way to build on that.

hammer commented 5 years ago

It could also be interesting to look at coreference resoultion and relation extraction across sentence boundaries, e.g. Inter-sentence Relation Extraction for Associating Biological Context with Events in Biomedical Texts. Obviously this will likely be very hard so probably not something to explore now, just filing away for later.

hammer commented 5 years ago

I was doing some thinking about our upcoming call w/ @ajratner after our discussion yesterday. Two topics that we discussed: are there any ways to automate label function generation and could we make use of science corpus-specific training data in the models that feed the label functions and relation extractor?

For the first topic, I found two papers from Paroma Varma, another Chris Ré student, that may have some ideas for us. I think both papers describe the same system, called "Snuba" in one and "Reef" in the other. Code is at https://github.com/HazyResearch/reef. Paroma only lists the Snuba paper on her website, so maybe just read that one. We should also bring up w/ Alex the idea of making use of structures used by QA systems to inform candidate heuristic generation (in the language of Reef/Snuba).

For the second topic, I reread the ScispaCy and SciBERT papers last night and feel like we have to get some lift from making use of these pretrained models, even if they don't have precisely the entity types we need for NER. Table 8 in the ScispaCy paper shows that their custom rule-based tokenizer and domain-specific sentence segmenter massively improve the basic task of sentence segmentation. I wish they gave more details on the en_ner_jnlpba_md training process (e.g. do they fine-tune one of their en_core_sci models? Edit: looks like the code is in train_specialised_ner.py), as it would be interesting to see if we can use the same procedure to make an en_ner_tcellrel model for spaCy. Finally, Table 3 from the SciBERT paper shows a huge lift for the same ChemProt relation extraction task that Snorkel uses as a demo/example, so it seems to me that if we replace the Snorkel biLSTM w/ a model that uses SciBERT embeddings we should do a lot better for free.

hammer commented 5 years ago

One more project that may be interesting to discuss w/ Alex: Babble Labble. Perhaps writing label functions would be less laborious if you were authoring them in natural language? More interesting than that, though, is whatever data structure represents the parsed natural language that is used to generate the label function. That intermediate representation could be the right target for candidate heuristic generation.

hammer commented 5 years ago

Doing a bit more reading on fine-grained entity recognition. The foundational paper in this field seems to be Fine-grained entity recognition (2012) by Xiao Ling and Daniel Weld. What's interesting for us is that they do fine-grained NER in support of relation extraction and show a significant gain in performance on RE.

hammer commented 5 years ago

FWIW I decided to look into the GO terms that correspond to the relations we're learning and I think these map to secretion and differentiation, though I don't know if there's a way to be more specific about cytokine-induced differentiation versus TF-induced differentiation...

hammer commented 5 years ago

Oh I should also dump the Python libraries I've seen for working w/ ontologies

hammerlab / t-cell-relation-extraction

Future directions #1