huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.83k stars 474 forks source link

Wrong Mention Type one-hot vectors during training due to a small bug in dataset.py #340

Open valedica opened 2 years ago

valedica commented 2 years ago

I think there is a small bug in dataset.py that affects the building of the Mention Type one-hot vectors of antecedent mentions in the pair features during training. Due to the use of slicing by a colon in the first dimension, the assignment is made on the full columns referred by the index in the 1-D array ant_features_raw[:, 0], which contains the mention type of the antecedent mentions expressed as integer. The expected behaviour I think was to put at 1 a single bit only, indexed by the 1-D array, for each row/antecedent mention, as it's done for the main mention.

https://github.com/huggingface/neuralcoref/blob/60338df6f9b0a44a6728b442193b7c66653b0731/neuralcoref/train/dataset.py#L230-L231

This causes a mismatch between the training features and the inference ones: in neuralcoref.pyx, the mention type is correctly encoded as a one-hot vector for each mention, and then copied in the pair features for the antecedent mentions.

This is a simple example with numpy comparing actual vs expected results:

Screenshot 2022-03-11 at 13 09 25