huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.84k stars 474 forks source link

Some extracted mentions in the coreference clusters have bad format #277

Closed 85405115 closed 2 years ago

85405115 commented 4 years ago

Hi, I have a problem when working with neuralcoref. I want to do coreference resolution for my doc and then split its resolved version to sentences. I expect that the number of sentences in the original doc and its resolved version be the same. but the number of sentences in the resolved version is less. I check the sentences and understand the reason. In the resolved version, each mention in the text has replaced with the most representative entity (call it MRE) in its coreference cluster. But: This MRE may be is a mention in the middle of a sentence and so its first word in lower case. or This MRE may be is located at the end of a sentence and so it has a dot at the end.

The first situation causes we have a sentence starts with a lower case word and so NLTK sent_tokenizer can not considers it a sentence. The second situation causes we have a wrong dot at the middle of a sentence and so NLTK sent_tokenizer considers that sentence as two sentences.

I think neuralcore should upper case MRE when it replaces with the first word in a sentence. And also neuralcoref should drop dot at the end of the mentions in the coreference clusters.

Can I set neuralcoref to do these changes?

Thanks in advance.

svlandeg commented 3 years ago

Could you provide a minimal code snippet, showing how you're using neuralcoref, an example sentence with a substitution, the output, and (in contrast) the preferred output you would want to obtain? I often find it easier to reason over code than over descriptions of functionality ;-)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.