hammerlab / t-cell-relation-extraction

Literature mining for T cell relations
23 stars 5 forks source link

Fine-grained T cell typing #2

Open eric-czech opened 5 years ago

eric-czech commented 5 years ago

@hammer, I put together a notebook to start exploring how well embeddings might work to infer dimensions of T cell typing, beyond protein expression and general phenotypic qualifiers (exhausted, activated, antigen-specific, etc.).

To get a basic sense of that variety, I did what they did in NormCo using the summation of token embedding vectors for noun phrases from word2vec trained on PMC/PubMed. The embedding projection here gives some interesting clustering:

PMC/PubMed T Cell Embedding Projection

Zooming in on the part I mapped out a bit with the annotations shows fairly broad categorizations like:

Screen Shot 2019-06-11 at 7 32 02 AM

My take after hovering over a bunch of those groups is that these seem to be common dimensions for the descriptions:

Do you think any of those make for useful characterizations we should keep in mind before trying to map the types to Cell Ontology or something like it?

hammer commented 5 years ago

These are really interesting categories! To better understand what's going on, is there a notebook that constructs tags.csv? The output of the JNLPA NER tagger looks funny, w/ the most common Th1 and Th2 cell types being those containing a "/". I think it may just be because you didn't sort in descending order.

Also you got your word vectors from http://bio.nlplab.org, yes?

Overall I think the hard part for us won't be mapping to Cell Ontology but rather determining which of these modifiers deserves to be included in an extension of Cell Ontology that we construct ourselves. I need to do some reading on how Cell Ontology wants to be extended. For mapping to Cell Ontology it seems like it may be useful to distinguish between modifiers that don't alter the underlying cell type (e.g. tissue, protocol) and those that do (e.g. expression markers).

I also need to think more about the simplistic vector addition strategy used by NormCo. The geometry of embedding space is pretty tricky. It would be fun to try some dumb vector math though like "human CD8+ T cell" - "human" + "mouse" or something.

Another thought: disease normalization is maybe not the best analogy for what we're trying to do. We have a pretty small existing ontology and a huge, indeterminate set of strings that we want to distill down to terms and then create a hierarchy among those terms. In disease normalization the existing ontology is enormous and you don't want to extend it you just want to map to a term that's closest to the string you have in hand.

One last thought: do any ontology specification languages accommodate traits rather than the "is-a" hierarchical relation? Will look into it. Seems like a better fit for this particular domain.

hammer commented 5 years ago

Perhaps a more useful analogy for our task: HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion (2018). I haven't read it yet but it sounds promising from the abstract and Jiawei Han is a leading thinker in this space.

hammer commented 5 years ago

Another paper with some interesting related work: User-Centric Ontology Population (2018).

hammer commented 5 years ago

Last thought for the night: this giant data frame of strings looks like a great use case for Vaex. Have you ever tried it out? It uses Apache Arrow under the hood, which I'm excited about.

eric-czech commented 5 years ago

Touching on a few of those:

hammer commented 5 years ago

I won't focus on trying to infer missing types

Sorry, I didn't mean to assert that we should not try to map mentions to CL types. We should definitely do that, and build synonym lists for the various CL types that already exist. At some point, however, we need to determine when we don't believe a mapping is possible. At that point, we can extend CL to include unmapped mentions that we believe correspond to a fine-grained T cell type.

I read through the HiExpan paper last night and their approach looks reasonable and they have code available at https://github.com/mickeystroller/HiExpan. One note on the code: they make use of a bunch of their own projects, which is always a little suspicious. I suspect we could implement their strategy with more widely adopted tools if we really like their approach.

Anyways, they take as input a collection of documents and a "seed taxonomy" (CL in our case), then they do:

  1. Key phrase extraction using AutoPhrase. I think we can just use your white list + JNLPA NER mention detector.
  2. "Width expansion" with SetExpan. The idea is that they "horizontally" expand each level of the hierarchy with terms that belong in the same set as the other terms on that level. Explictly inspired by Google Sets, which you may have tried out back in the day!
  3. "Depth expansion" for terms w/o children. Their approach here is pretty clever: they look at siblings of the term and the children of those siblings and calculate the vector in embedding space that points from sibling to sibling's child. They can then add this vector to the term and look for the 3 mentions most similar to the computed vector in embedding space. They use REPEL to get the embeddings.
  4. Conflict resolution/hierarchy adjustment: if a term appears at two places in the taxonomy, they compute a confidence score and only keep the most confident location.
hammer commented 5 years ago

One note on terminology: I prefer to use the word "term" when talking about a single node in an ontology. In the NormCo paper they say "concept", which I don't like for reasons best explained in the BFO book. Also in this literature the term "entity type" is used, e.g. https://github.com/shimaokasonse/NFGEC.

So, in the context of this discussion, "term" == "concept" == "entity type", and I prefer "term".

hammer commented 5 years ago

One last comment for the night: the distantly supervised training data extracted from BioASQ fits the format of the bag-of-sentences w/ bag-level labels model from the AI2 paper https://github.com/allenai/comb_dist_direct_relex. I recall us finding that formulation somewhat strange in the AI2 paper so it's amusing to see it in the NormCo paper in a slightly different context.

eric-czech commented 5 years ago

Alright then, I'm imagining a process like this to try out HiExpan:

  1. Start with JNLPA and white list terms
  2. Figure out how to tokenize protein expression strings (e.g. Thy1.1+OT-1+CD8+)
  3. Remove tokens in terms relating to unwanted dimensions of characterization (protocol, treatment, specificity, etc.) using either:
    • positive selection - Try to keep only what is a protein or a type name (Th17, mucosal-associated invariant, etc.); this seems like the easiest way for sure but at the expanse of ignoring T cell type names we don't know about
    • negative selection - Try to remove all tokens like CMV-specific, TIL, autologous, etc. that have little to do with the underlying cell type
  4. Featurization:
    • Use skip-gram features as they have implemented already
    • Convert multi-gram (e.g. "CD4-CD8-CD3+ (double-negative, DN) T cells") terms to entity tokens (e.g. ENTITY1) and retrain word2vec on all corpus sentences to get entity embedding features
      • The HiExpan code only references word2vec (which seems strange since they don't emphasize that much in the paper) but I suppose we could figure out how to use REPEL if that fails
    • Not sure what to use as entity type features, or how to get something like a list of entity type probabilities for each term as they are from Probase/MS Concept Graph. Do you not see a use for that since you're saying that "term" == "entity type" in this discussion?
  5. Seed with CL and let the rest of the pipeline run as is

The biggest hole I can see in that plan is that there will be a ton of synonymous terms on the same level like "T-helper (Th)17" == "T helper 17" == "CD4+CD161+CD196+ T cells" since HiExpan doesn't make any attempt to resolve them. I could resolve them easily if the whole process starts only with the white list terms, but I don't see a way to make it work with the JNLPA terms without first doing the kind of term collation I was alluding to before. And by that I don't mean entity linking -- I'm thinking more like an unsupervised clustering -- is there a word for that in NLP?

If nothing else, tokenizing those protein expression strings seems pretty critical for this domain since I can see now that neither ScispaCy nor the PMC word2vec tokenizer really do it all, and it'd have to be done if we wanted to match to CL using anything beyond the cell type string names/aliases. Perhaps that would be a good place to start as a standalone project compatible with any kind of spaCy pipeline?

One last thought on tokenization: I did notice that the WordPiece tokenization used in sciBERT actually does a good job of chunking up those no-whitespace strings where CD4+CD8- becomes ['CD', '##4', '+', 'CD, '##8', '-'] rather than one single token. Do any thoughts come to mind as to how we could exploit that?

hammer commented 5 years ago

Figure out how to tokenize protein expression strings

It looks like in the notebook you're trying str.split() and the en_core_sci_md tokenizer. It sounds like there are two more to consider, the PMC word2vec and WordPiece tokenizers. Given that the plan is to get embeddings for each token and then sum those embeddings in order to map to CL terms, it does seem like we should do a pretty thorough job of benchmarking the tokens we embed.

Do any thoughts come to mind as to how we could exploit that?

One thought is that there are containment relationships implied by the combination of markers which could help place terms in the hierarchy. I also wonder if we can make uses of gene name synonym lists to canonicalize these marker lists? One synonym I see a lot: CD137 <--> 4-1BB. We can then use mappings back to the protein ontology and hierarchical structures that have been built on that (e.g. families of cytokine receptors) to have additional structure to use to compare to the embedding results.

Remove tokens in terms relating to unwanted dimensions of characterization

I wonder if there is a way to do this pruning in a way that's not totally manual. We're also getting into "noun phrase internal" relation extraction territory a bit when we consider complex noun phrases like "IL17-producing CD4+ T cells". There's also a lot of structured metadata we could collect from these phrases, as you have outlined in your first comment on this thread. It may be that we are collecting attributes of an entity as well as inferring the entity type.

One thing I'm struggling with for our problem: the distinction between entities and entity types. These papers rarely if ever discuss individual entities (i.e. cells); they're almost always talking about properties of a collection of cells distinguished by their entity type.

I found the distinction between "universals" and "defined classes" in the BFO book to be useful in this context. I think an "entity type" corresponds to a "universal", while entity attributes can be used to name "defined classes". What we're trying to figure out is when a token or phrase distinguishes a novel universal versus when it's just an attribute that can be used to organize "defined classes".

The biggest hole I can see in that plan is that there will be a ton of synonymous terms on the same level

That's a great observation. It does seem like their method needs a second form of "conflict resolution" which involves finding terms on the same level that are actually just synonyms; they can then remove all but one of the terms in that round. Does the CL provide synonym lists for their terms? Seems like they should.