Summary notebook comments

hammer commented 5 years ago

Comments on commit https://github.com/hammerlab/t-cell-relation-extraction/commit/0f6f5d8669f66b7b0e6176a9cbf72b6e324385a2

hammer commented 5 years ago

Corpora

For the dev corpus, how many documents are returned by the search query in total?
If it's easy to get the PMID for each document, we should report the size of the intersection between then dev and primary corpus.
Does it make sense to use MeSH to limit the primary corpus? I seem to recall you using this approach initially. For example, there's a T-Lymphocyte Subsets subject heading.

hammer commented 5 years ago

Metadata

Maybe not the best name for this section? Ontologies, Controlled Vocabularies, Entities all valid options I think.
Some summary information might be useful for each category of entity. How many entities and surface forms for each entity? How many novel entities and surface forms come from manual entry versus the external ontology?
For surface marker token matching, should we just extract proteins that have the pattern CD[0-9]+ instead of all that start w/ CD? Maybe this doesn't matter much.
Also for surface marker token matching, does the PRO not annotate proteins w/ surface expression status? It would be nice to just filter on some field rather than having to define our own rule.

hammer commented 5 years ago

Tagging

Again, maybe not the best name for the section? Named Entity Recognition pipeline or something might be better.
Some discussion of the various sentence segmentation and tokenization challenges of the scientific domain might be useful, with a pointer to the ScispaCy benchmarks. I think the move towards subword tokenization with WordPiece as used by SciBERT might also deserve some discussion.

hammer commented 5 years ago

Phenotype Inference

This section describes an important novel contribution of this work, I think! The plot In [632] is a lot of fun to dig into, particularly for markers w/ a balanced number of positive and negative mentions for a particular cell type (e.g. CD27 for gd T cells).
As discussed a bit over Slack, we probably need some attempt to quantify the performance of our novel tokenization strategy in ptkn.
I'm not sure I understand the tl;dr of the Naming Modality Frequencies section. Given the motivating example of using surface markers + T cell vs. surface markers + Treg cell, does your assertion mean that 66% of the time we find the former phrasing?
It would be fun to roll up the data from the In [632] plot to the level of CD8+ vs. CD4+ T cells or even T cells in general to see which markers are used the most. We should also be able to compute the markers that best distinguish cell types from this data and compare to the marker lists you found in the OMIP Flow Repository entries.
Some data on how many surface forms/entity mentions/noun phrases we are able to map back to CL terms with high confidence would be useful in this section. My mental model for what we're doing is mapping a giant list of noun phrases back to a small hierarchy of CL terms (+ synonyms), and some subset of noun phrases will not map with high confidence onto any CL terms. Our next task after this section will then be to determine if we need to either add a synonym to an existing CL term, make a new CL term, or throw out the noun phrase. Does that sound right to you?

hammer commented 5 years ago

Candidate Generation/Candidate Annotation/Labeling Functions

Does "Candidate Generation" mean the identification of sentences that could contain relations?
I really appreciate the labeling function writeup! The point about making configuration-driven labeling functions to make it easier to refactor is a very interesting one. It's almost like a labeling function development environment (LFDE) is needed.
Perhaps Alex Ratner could help us put the statistics about the labeling functions in context? I remember on our call he asked to look at these statistics as if they are an important diagnostic. It would be useful to have his insights into what these numbers look like for good vs. bad labeling functions.
What's going on in the Dev set supervised modeling section? How does it relate to the Discriminative model section?

hammer commented 5 years ago

As discussed a bit over Slack, we probably need some attempt to quantify the performance of our novel tokenization strategy in ptkn.

As discussed in person yesterday, it's probably best to quantify how our tokenization + discarding non-surface marker tokens strategy works by comparing it to the performance of the NormCo-inspired embed all tokens and add embedded vectors strategy for mapping entity mentions to CL terms. A good chance to make use of the cool work in https://github.com/hammerlab/t-cell-relation-extraction/issues/2#issue-454734546!

eric-czech commented 5 years ago

To your comments (anything omitted went on to a TODO list in the summary report as-is):

Corpora

For the dev corpus, how many documents are returned by the search query in total?

It returns 124,720 results (as of ~March). My hope was that sorting by relevance through the Entrez api would help give me the top ~20k results that would require the least amount of filtering to find docs worth annotating.

Does it make sense to use MeSH to limit the primary corpus? I seem to recall you using this approach initially. For example, there's a T-Lymphocyte Subsets subject heading.

I was doing that initially but wasn't able to get very large result sets. I added some of my queries + result count experiments to the summary but I think this difference demonstrates the issue I was having:

Conceptual search target: Humans AND T-cells AND cytokines AND differentiation AND induction
Query with keywords (using the query I used originally):
- (human) AND ( (t cell) OR (t lymphocyte) ) AND (cytokine) AND ((differentiate) OR (differentiation) OR (differentiated)) AND ((polarization) OR (polarize) OR (induce) OR (induction))
- Num Results: 129,982 (compared to ~124k back in March)
Query with comparable MeSH terms:
- "humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cytokines"[MeSH Terms] AND "cell differentiation"[MeSH Terms] AND "transcriptional activation"[MeSH Terms]
- Num Results: 16

The queries I added to the summary show what happens as you make the query less specific, but even at the "humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) level you still only get ~48k results. I am probably missing some similar MeSH terms that would increase the number of matching docs but still, I thought it made sense to prioritize recall over precision for building the corpora.

Metadata

For surface marker token matching, should we just extract proteins that have the pattern CD[0-9]+ instead of all that start w/ CD? Maybe this doesn't matter much.

There are at least a few surface forms like "CDw198" (CCR8) that I think make sense to catch. Is there some set of CD* molecules you think would be problematic to include?

Also for surface marker token matching, does the PRO not annotate proteins w/ surface expression status? It would be nice to just filter on some field rather than having to define our own rule.

I think the best way they would capture that is with GO annotations and for example CD14 (chicken) has the cell surface GO annotation, yet CD4 (human) (and its parent CD4 molecule) and a few others I checked for humans do not despite have the string "surface" in several of the synonyms. I also don't see any other GO annotations common to all of them that indicate filtering by annotation would work very well.

Phenotype Inference

I'm not sure I understand the tl;dr of the Naming Modality Frequencies section. Given the motivating example of using surface markers + T cell vs. surface markers + Treg cell, does your assertion mean that 66% of the time we find the former phrasing?

A better way to break it down -- of all 100% of the entity mentions from the JNLPBA tagger:

49% have no information about the underlying cell type ("alloreactive T cells")
2.5% pertain to more than one type ("Th1 and Th2 cells")
of the remaining 48.5%:
- 39% are easy ("Th1" or "Tregs")
- 3.6% have names and markers ("CD4+CD25+Foxp3+ Treg")
- 5.9% have markers only ("CD4+CD25+Foxp3+ T cells")

By 66% I mean (5.9% / (5.9% + 3.6%)) ~= 2/3, implying that the "CD4+CD25+Foxp3+ T cells" phrasing is more common after removing entity mentions that are either impossible to classify or are very easy to classify.

Some data on how many surface forms/entity mentions/noun phrases we are able to map back to CL terms with high confidence would be useful in this section. My mental model for what we're doing is mapping a giant list of noun phrases back to a small hierarchy of CL terms (+ synonyms), and some subset of noun phrases will not map with high confidence onto any CL terms. Our next task after this section will then be to determine if we need to either add a synonym to an existing CL term, make a new CL term, or throw out the noun phrase. Does that sound right to you?

That sounds right to me, and I think that analysis I did will at least go a long ways towards knowing what to throw out. I'm not sure what's more likely to constitute a new CL term though. Do you think they're more likely to look like CD182+ CD194- CD33+ T cells (i.e. an unheard of marker combination) or more like Th99 cells (i.e. an unheard of acronym/shorthand name)?

Candidate Generation/Candidate Annotation/Labeling Functions

Does "Candidate Generation" mean the identification of sentences that could contain relations?

Yea more or less. I thought it would be worth explaining how tags and sentences are combined to create candidates. The process is pretty simple where every pairwise combination of tags (of the appropriate types) in any one sentence results in a new candidate, but it took me a bit to confirm as much in the Snorkel code.

It's almost like a labeling function development environment (LFDE) is needed.

I agree! Particularly when it comes to building a lot of regex based patterns using synonyms -- there seems to be a lot of room for tooling like that.

What's going on in the Dev set supervised modeling section? How does it relate to the Discriminative model section?

I'm working on building classifiers as labeling functions using only the manually annotated dev data. I think they'll make for decent labeling functions and more importantly a good baseline since the "discriminative model" (the one trained on the probabilistic labels from their generative model) should be able to perform better when evaluated on the same validation + test datasets I created, which are also hand labeled.

hammer commented 5 years ago

Do you think they're more likely to look like CD182+ CD194- CD33+ T cells (i.e. an unheard of marker combination) or more like Th99 cells (i.e. an unheard of acronym/shorthand name)?

Both! I guess it depends on the frequency of the entity mention. Figuring out where to position it within the CL ontology will be fun, as well as determining if any of these new types are synonyms of one another. If we wanted to get fancy we could also weight by credibility using things like the quality of the journal , the citation count of the paper, the track record of the authors reporting the new type.

I suspect we'll need to do some manual inspection of the entity mentions that we have left over following the entity linking step, similar to the manual inspection you did following NormCo-style embedding.

hammer commented 5 years ago

for example CD14 (chicken) has the cell surface GO annotation, yet CD4 (human) (and its parent CD4 molecule) and a few others I checked for humans do not despite have the string "surface" in several of the synonyms.

Well that's disappointing. It's also disappointing that the GO Annotations team only accepts suggested improvements via email. I wish they had a discussion forum! The EMBL-EBI needs a single Discourse instance for all of their open data projects...

eric-czech commented 5 years ago

Hey @hammer, I updated the summary and tied it to a release that I think is a solid improvement on my first loop through the whole process where I had no real evaluation criteria other than correlation with iX.

I added "Strong Supervision" and "Weak Supervision" sections to the summary that include some performance numbers on RE for each task now that I have separate validation and test sets.

My takeaways from this iteration are:

For each relation type, I have 120-150 positive examples and the large majority of them are used for training with only 10-30 in either non-training (i.e. validation or test) dataset. For comparison, the CDR/CID subtask data includes 3k examples split into ~1k for training, validation, and test partitions. This has made it very difficult to pick a single model to report performance on since the validation and test scores end up not being very well correlated. At the very least, I'm going to need more data until I have a validation set size large enough to trust that higher performance on it will translate more reliably to data not involved in training.
The scores I'm currently seeing are reasonable but not great (F1 in mid 60s) and a good bit off from the >90% precision reported by iX. These are reported as medians across all models in a hyperparameter grid in the "Weak Supervision" section of the summary.
I have so far viewed RE modeling as largely orthogonal to ontology matching, particularly because masking the entities during training is actually helpful. However, incorporating the confidence/credibility of those matches (as they did in iX) or some notion of the generality of the ontology term as model features and/or heuristics in labeling functions would change that. I may make the assumption in the next iteration that RE modeling is dependent upon on the ontology information, which will make all of the tasks related to it a higher priority.
Modeling the secreted and inducing cytokine relations with a multi-task model would probably be much better than the way I'm trying to transfer information between the tasks through LFs now.

My next steps before trying to run another iteration of this loop will be:

Generate more data for annotation (Arman generously offered to help)
- I think the annotations can still be independent of the ontology matching since I can always match the entities in the annotations the same way I would match them when encountered anywhere else in a corpus
Complete all tasks above related to ontology matching / extension and make a final decision on how they will be done going forward.
Re-tag all of my dev corpus documents so that the ontology data is available
Add features and/or labeling functions that incorporate the ontology data
Re-evaluate performance given both the new features and the larger evaluation dataset

An alternative I see to this is that if that performance is at least close to good enough (F1 between 60-70), then I can let ontology matching remain as an orthogonal task and collect just enough new evaluation data to be comfortable picking and applying a final model. Results from applying that final model could then easily be merged with any progress made on the ontology matching front, without needing to re-run the very expensive modeling loop. I think I could manage doing that and writing it up by mid-August, but that full list above would likely take longer.

What do you think?

hammerlab / t-cell-relation-extraction