Potential label leakage for over-long document

syuoni / eznlp

Easy Natural Language Processing

Apache License 2.0

130 stars 21 forks source link

Potential label leakage for over-long document #27

Closed yhcc closed 2 years ago

yhcc commented 2 years ago

Interesting framework for NLP. The processing for the over-long document may leak the label information in https://github.com/syuoni/eznlp/blob/5d7b7c2cd7131842ef99b5d0b5dacf53e530900c/eznlp/model/bert_like.py#L415 if the document is too long, this line of code will force the truncation to happen right before the entity (for NER task). Although this will not affect too many samples, I believe we should not use any information from the label.

syuoni commented 2 years ago

Thanks for pointing this out.

I've just checked this issue. It turns out that the affected entities (in the testing set) are 0 for CoNLL03, and 25 for OntoNotes 5. ACE04 and ACE05 do not rely on these codes.

Hence, at maximum (assuming all the affected entities are not recalled), it will cause a 0.2% decrease in the recall rate, and an even smaller decrease in the F1 score on OntoNotes 5. Hence, this will not affect our conclusion in the paper.

Best, Enwei

syuoni commented 2 years ago

I will fix this code, and report the new experimental results on OntoNotes 5.

Thanks again.

yhcc commented 2 years ago

Never mind. It is not a big issue. Thank you for your reply.

syuoni commented 2 years ago

In case anyone may be interested, here are the "new" experimental results:

	CoNLL 2003	OntoNotes 5
Reported F1	93.65	91.74
New F1	93.61	91.69

In the "new" processing code, we use the original sentences as the reference points when concatenating the sentences to document-level inputs. More specifically, we concatenate sentences such that the resulting sequences do not exceed the length restricted by PLM, i.e., 512.

The code will be updated in the next version.