Evaluation - Githubissues

Hi,

I have run neuralcoref on the test set of the Conll 2012 English Test Data, but I am not sure, whether the results are correct. The evaluation results are well below those which Clark and Manning (2016, Deep Reinforcement Learning for Mention-Ranking Coreference Models) give for their approach. I wonder where is my error. In order to test, I concatenate all input tokens (using tab), and reconfigure spacy's tokenizer to tokenize at TABs. Once coreferences are predicted I output the result as a Conll 2012 file and align sentences to the original input (in order to assure that the tokenisation of system output and gold is identical) run the scorer.pl evaluation script (v8.01). Using the MUC metric I get F1 of 50%. for coreferences, and 59 of identification of mentions. Does anybody has an idea where the problem could be ?

huggingface / neuralcoref

Evaluation #270