Open hammer opened 5 years ago
CD[0-9]+
instead of all that start w/ CD
? Maybe this doesn't matter much.In [632]
is a lot of fun to dig into, particularly for markers w/ a balanced number of positive and negative mentions for a particular cell type (e.g. CD27 for gd T cells).ptkn
.tl;dr
of the Naming Modality Frequencies
section. Given the motivating example of using surface markers + T cell vs. surface markers + Treg cell, does your assertion mean that 66% of the time we find the former phrasing?In [632]
plot to the level of CD8+ vs. CD4+ T cells or even T cells in general to see which markers are used the most. We should also be able to compute the markers that best distinguish cell types from this data and compare to the marker lists you found in the OMIP Flow Repository entries.Dev set supervised modeling
section? How does it relate to the Discriminative model
section? As discussed a bit over Slack, we probably need some attempt to quantify the performance of our novel tokenization strategy in
ptkn
.
As discussed in person yesterday, it's probably best to quantify how our tokenization + discarding non-surface marker tokens strategy works by comparing it to the performance of the NormCo-inspired embed all tokens and add embedded vectors strategy for mapping entity mentions to CL terms. A good chance to make use of the cool work in https://github.com/hammerlab/t-cell-relation-extraction/issues/2#issue-454734546!
To your comments (anything omitted went on to a TODO list in the summary report as-is):
For the dev corpus, how many documents are returned by the search query in total?
It returns 124,720 results (as of ~March). My hope was that sorting by relevance through the Entrez api would help give me the top ~20k results that would require the least amount of filtering to find docs worth annotating.
Does it make sense to use MeSH to limit the primary corpus? I seem to recall you using this approach initially. For example, there's a T-Lymphocyte Subsets subject heading.
I was doing that initially but wasn't able to get very large result sets. I added some of my queries + result count experiments to the summary but I think this difference demonstrates the issue I was having:
Humans AND T-cells AND cytokines AND differentiation AND induction
(human) AND ( (t cell) OR (t lymphocyte) ) AND (cytokine) AND ((differentiate) OR (differentiation) OR (differentiated)) AND ((polarization) OR (polarize) OR (induce) OR (induction))
"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cytokines"[MeSH Terms] AND "cell differentiation"[MeSH Terms] AND "transcriptional activation"[MeSH Terms]
The queries I added to the summary show what happens as you make the query less specific, but even at the "humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms])
level you still only get ~48k results. I am probably missing some similar MeSH terms that would increase the number of matching docs but still, I thought it made sense to prioritize recall over precision for building the corpora.
For surface marker token matching, should we just extract proteins that have the pattern CD[0-9]+ instead of all that start w/ CD? Maybe this doesn't matter much.
There are at least a few surface forms like "CDw198" (CCR8) that I think make sense to catch. Is there some set of CD* molecules you think would be problematic to include?
Also for surface marker token matching, does the PRO not annotate proteins w/ surface expression status? It would be nice to just filter on some field rather than having to define our own rule.
I think the best way they would capture that is with GO annotations and for example CD14 (chicken) has the cell surface GO annotation, yet CD4 (human) (and its parent CD4 molecule) and a few others I checked for humans do not despite have the string "surface" in several of the synonyms. I also don't see any other GO annotations common to all of them that indicate filtering by annotation would work very well.
I'm not sure I understand the tl;dr of the Naming Modality Frequencies section. Given the motivating example of using surface markers + T cell vs. surface markers + Treg cell, does your assertion mean that 66% of the time we find the former phrasing?
A better way to break it down -- of all 100% of the entity mentions from the JNLPBA tagger:
By 66% I mean (5.9% / (5.9% + 3.6%)) ~= 2/3, implying that the "CD4+CD25+Foxp3+ T cells" phrasing is more common after removing entity mentions that are either impossible to classify or are very easy to classify.
Some data on how many surface forms/entity mentions/noun phrases we are able to map back to CL terms with high confidence would be useful in this section. My mental model for what we're doing is mapping a giant list of noun phrases back to a small hierarchy of CL terms (+ synonyms), and some subset of noun phrases will not map with high confidence onto any CL terms. Our next task after this section will then be to determine if we need to either add a synonym to an existing CL term, make a new CL term, or throw out the noun phrase. Does that sound right to you?
That sounds right to me, and I think that analysis I did will at least go a long ways towards knowing what to throw out. I'm not sure what's more likely to constitute a new CL term though. Do you think they're more likely to look like CD182+ CD194- CD33+ T cells
(i.e. an unheard of marker combination) or more like Th99 cells
(i.e. an unheard of acronym/shorthand name)?
Does "Candidate Generation" mean the identification of sentences that could contain relations?
Yea more or less. I thought it would be worth explaining how tags and sentences are combined to create candidates. The process is pretty simple where every pairwise combination of tags (of the appropriate types) in any one sentence results in a new candidate, but it took me a bit to confirm as much in the Snorkel code.
It's almost like a labeling function development environment (LFDE) is needed.
I agree! Particularly when it comes to building a lot of regex based patterns using synonyms -- there seems to be a lot of room for tooling like that.
What's going on in the Dev set supervised modeling section? How does it relate to the Discriminative model section?
I'm working on building classifiers as labeling functions using only the manually annotated dev data. I think they'll make for decent labeling functions and more importantly a good baseline since the "discriminative model" (the one trained on the probabilistic labels from their generative model) should be able to perform better when evaluated on the same validation + test datasets I created, which are also hand labeled.
Do you think they're more likely to look like CD182+ CD194- CD33+ T cells (i.e. an unheard of marker combination) or more like Th99 cells (i.e. an unheard of acronym/shorthand name)?
Both! I guess it depends on the frequency of the entity mention. Figuring out where to position it within the CL ontology will be fun, as well as determining if any of these new types are synonyms of one another. If we wanted to get fancy we could also weight by credibility using things like the quality of the journal , the citation count of the paper, the track record of the authors reporting the new type.
I suspect we'll need to do some manual inspection of the entity mentions that we have left over following the entity linking step, similar to the manual inspection you did following NormCo-style embedding.
for example CD14 (chicken) has the cell surface GO annotation, yet CD4 (human) (and its parent CD4 molecule) and a few others I checked for humans do not despite have the string "surface" in several of the synonyms.
Well that's disappointing. It's also disappointing that the GO Annotations team only accepts suggested improvements via email. I wish they had a discussion forum! The EMBL-EBI needs a single Discourse instance for all of their open data projects...
Hey @hammer, I updated the summary and tied it to a release that I think is a solid improvement on my first loop through the whole process where I had no real evaluation criteria other than correlation with iX.
I added "Strong Supervision" and "Weak Supervision" sections to the summary that include some performance numbers on RE for each task now that I have separate validation and test sets.
My takeaways from this iteration are:
My next steps before trying to run another iteration of this loop will be:
An alternative I see to this is that if that performance is at least close to good enough (F1 between 60-70), then I can let ontology matching remain as an orthogonal task and collect just enough new evaluation data to be comfortable picking and applying a final model. Results from applying that final model could then easily be merged with any progress made on the ontology matching front, without needing to re-run the very expensive modeling loop. I think I could manage doing that and writing it up by mid-August, but that full list above would likely take longer.
What do you think?
Comments on commit https://github.com/hammerlab/t-cell-relation-extraction/commit/0f6f5d8669f66b7b0e6176a9cbf72b6e324385a2