huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.84k stars 474 forks source link

How to fix the output? [Found too many repeated mentions (> 10) in the response] #286

Open Mak-Ta-Reque opened 3 years ago

Mak-Ta-Reque commented 3 years ago

🌋 Computing score

Error during the scoring

Command '['perl', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/scorer_wrapper.pl', 'muc', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/data/key.txt', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/test_mentions.txt']' returned non-zero exit status 1.

Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output.

version: 8.01 /Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/scorer/lib/CorScorer.pm

Repeated mention in the response: 116, 121 1818

Repeated mention in the response: 1065, 1066 136136

Repeated mention in the response: 825, 825 152152

Repeated mention in the response: 92, 94 3333

Repeated mention in the response: 169, 169 4747

Repeated mention in the response: 26, 26 4242

Repeated mention in the response: 26, 26 4242

Repeated mention in the response: 26, 26 4242

Repeated mention in the response: 66, 68 1717

Repeated mention in the response: 254, 254 8888

Repeated mention in the response: 268, 268 9090
Traceback (most recent call last):

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main

    "__main__", mod_spec)

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code

    exec(code, run_globals)

  File "/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/learn.py", line 565, in <module>

    run_model(args)

  File "/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/learn.py", line 175, in run_model

    eval_evaluator.test_model()

  File "/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/evaluator.py", line 180, in test_model

    self.get_score(file_path=ALL_MENTIONS_PATH)

  File "/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/evaluator.py", line 292, in get_score

    encoding="utf-8",

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/subprocess.py", line 395, in check_output

    **kwargs).stdout

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/subprocess.py", line 487, in run

    output=stdout, stderr=stderr)

subprocess.CalledProcessError: Command '['perl', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/scorer_wrapper.pl', 'muc', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/data/key.txt', '/Users/mak/PycharmProjects/tradr_language_tool/neuralcoref/neuralcoref/train/test_mentions.txt']' returned non-zero exit status 1.
LuxuriantHuang commented 3 years ago

I have the similar issues, so have you found a way to slove the problem?

csgomezg0 commented 2 years ago

Did you find some solution?

Mak-Ta-Reque commented 2 years ago

No

On Mon, Aug 2, 2021, 9:27 PM csgomezg0 @.***> wrote:

Did you find some solution?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/neuralcoref/issues/286#issuecomment-891275681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGL7TKSQ6QTFUDGHPQBNFEDT23WRNANCNFSM4S7CX6AA .

csgomezg0 commented 2 years ago

Maybe this can help: In the folder train, there is a folder called scorer and a folder called lib, in the file CorScorer.pm line 384 change the number 10 for a bigger number, maybe 1000000 or other. This solution maybe is not correct but it's work for the training. if someone know other solution can correct me, thanks.

Pantalaymon commented 2 years ago

Which language are you trying to train your model on?

I had this issue while trying to make a model for french and I realised that the issue came from a bad tokenization. The tokenization produced by spacy didn't match the already-made tokenization of the dev corpus.

As a result, many single tokens were considered as multiple tokens and the model was then running several predictions on those single tokens. As a consequence, those tokens ended up grouped in several identical mention spans (hence the repeated mentions comment).

csgomezg0 commented 2 years ago

Hi @Pantalaymon, I try with neuralcoref for train model in language Spanish but isn't work for me, maybe I have a lot of errors, I don't know, then I am trying with other model, coreferee.

Pantalaymon commented 2 years ago

Hi @Pantalaymon, I try with neuralcoref for train model in language Spanish but isn't work for me, maybe I have a lot of errors, I don't know, then I am trying with other model, coreferee.

Oh I didn't know that library. I see that it is pretty new. Is it easier to train on a new language than neuralcoref? I I might try it as well to compare.

sanaullahaq commented 2 years ago

@Mak-Ta-Reque facing the same problem. Did you find any solution? From where you have downloaded the dataset? I have from this repo https://github.com/clab/att-coref/tree/master/data/conll-2012 I don't know is there any problem with my downloaded dataset?

Pantalaymon commented 2 years ago

@Mak-Ta-Reque facing the same problem. Did you find any solution? From where you have downloaded the dataset? I have from this repo https://github.com/clab/att-coref/tree/master/data/conll-2012 I don't know is there any problem with my downloaded dataset?

Hi Sanullahaq. As I mentioned, it's not a problem with the dataset. The problem comes from the fact that spacy's tokenization does not match the tokenization in the CONLL file. As a consequence some mention boundaries that span over different tokens for spacy end up spanning over the same tokens in the CONLL output. To fix this you'll need to either :

But honestly, neuralcoref is not really meant to be extensible to other datasets... depending on your use case , as suggested above I would look at coreferee for which I successfully trained on a french model.

sanaullahaq commented 2 years ago

alas!!! btw I appreciate your response. would you like to give me any clue from where I can find pre-tokenized data?