snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

Candidate spans expand to match full tokens, causing goldset to candidate mapping failure + semantic inconsistencies #932

Closed littlewine closed 5 years ago

littlewine commented 6 years ago

Hello, I am trying to do relation mining in the BioCreative 6 task of Chemical-Protein interactions and I have come across the following problem: I am using the pretagged candidate extractor, passing for each document a list of entities including their types and start/end offsets. However, while trying to map my goldset into the candidates, I noticed the following behaviour: When the offsets given do not match directly the start and end offsets of the tokens, the candidate spans expands to include the whole word(s), causing the relation goldset mapping to the candidates to fail, and in many occasions a semantic inconsistency. This occurs many times, especially on the existence of special characters, such as ) - ;. For instance, when using SpaCy on the aforementioned dataset, I was able to map about 8300/10000 relations and far less when using CoreNLP.

To make it more concrete, consider the following examples (I am only considering one of the two entities):

Increased hepatic glycogen reduced the percent of glucose taken up by the liver that was deposited in glycogen (74 ± 3 vs. 53 ± 5% in Gly+INS and SCGly+INS, respectively, and 72 ± 3 vs. 50 ± 6% in Gly+PoG and SCGly+PoG, respectively).

Here, INS (which appears twice) is an entity which offsets are given on the goldset, but because of SpaCy's tokenization, the final candidate spans have mistakenly turned into Gly+INS and SCGly+INS.

In contrast, menthol- and icilin-activated TRPM8 currents were suppressed by low pH.

Here, menthol and icilin are the entities. In the second case, SpaCy breaks icilin and activated into different tokens, but in the first case (probably because no word appears after the hyphen), the whole token is menthol-, causing the offset to move by 1 character which ruins the mapping of the goldsets to the candidate.

The accumulation of platinum by both cultured rat DRG neurons and HEK/rCtr1 cells, during oxaliplatin exposure..

HEK and rCtr1 are seperate candidates, but in this case, because they are considered a single token, this would result in extracting the same candidate twice and extracting the relationship X INHIBITS HEK/rCtr1

Given that tokenization is usually very tricky (especially in biomedical text), I am not sure whether Snorkel's behaviour to extend the candidate spans in order to directly match the token offsets is optimal. However, I am not sure if there is an easy fix for that. A possible solution would be to overwrite SpaCy's tokenizer and force splitting of all special characters, but I am afraid this will cause problems further down the pipeline (for instance when using the word embeddings for LSTM).

stephenbach commented 6 years ago

Hi Antonios, Thanks for raising this issue! I've also grappled with mapping offset-based annotations to tokenizer output. I agree that the behavior should be stricter than expanding text spans. In non-Snorkel info. extraction applications I've worked on, I've treated annotations that do not map correctly to pre-processed input as automatic false negatives. I believe this is the right thing to do in terms of measuring predictive performance.

Do you think it would be an improvement to raise an exception or produce some error for spans that don't map, which could be caught and accounted for?

It would also be great if you can share some of the code you're using to actually do the mapping. It would help me think about how to improve this process. Thanks!

littlewine commented 6 years ago

Hey Stephen, thanks for getting back to me. The solution of tagging input as false negative sounds more "correct", although maybe too strict, at least in my use case. I think eventually it all comes down to what you are trying to do with Snorkel. If you are trying to build (and benchmark) a relationship extraction/classification system, you probably want to treat the named entity recognition part as a different component and try to isolate its effects (to the extend that is possible) from the rest of your system. For instance, I am using the BioCreative6 Chemical-protein interraction goldset which includes both the entities and the relationships themselves for evaluation, but in order to take advantage of snorkels' denoising process in the unlabelled set, I am using the preprocessed dump of NER tags given by the pubmed API. Another reason for which I believe this makes sense is that NER in the biomedical domain can be quite tricky and it could basically ruin any results you might have, no matter how good your classifier is (Recall of the protein NER of the GNormPlus tool which pubmed API is using is around 42% in the BioCr. goldset). Of course, in case you are building an end-to-end system for extracting entities and their relationships, this solution makes absolute sense.

Raising a warning would definitely give some more insight and allow people to debug these cases easier. For the use case of isolating the NER component from the rest of the system though, I think it would be really useful if you could somehow indicate to SpaCy or coreNLP the offsets of the named entities, so it could "correctly" tokenize the rest of the text. I can recall that there is an option to pass a dictionary/lexicon of tokens in the parser (I think the SpaCy tokenizer has such an option), so you could theoretically preprocess the text, extract the entities and pass them as a vocabulary. However, I did not try that, as I believe it would likely cause other problems (consider passing O as a chemical entity corresponding to oxygen - so the tokenizer would basically split each token containing an O).

Regarding the code I am using, I have adapted (with some minor changes) the code from the load_external_labels function in the intro tutorial after preprocessing my gold labels in the appropriate tsv format. Another important detail here is that although there was an option to mark all candidates absent from the goldset as negative, this would result to creating false negative labels in many cases. To come around this issue, I basically preprocessed the tsv, adding all None relationships found within a document as negative. This has eventually resulted in losing about ~15% of my positive candidates when using SpaCy as a tokenizer (I cannot find an appropriate way to calculate how many out of the negative ones I am losing - but it is not a major concern atm because of class imbalance between the negative and positive candidates).

If you have any more questions or would like to talk more about those issues, please don't hesitate to get in touch with me. Many thanks for your help!