huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.83k stars 474 forks source link

difference in sentence tokenization when using neuralcoref #312

Closed sagnik closed 3 years ago

sagnik commented 3 years ago

I have neuralcoref (4.1.0) installed from github with spacy 2.3.5. Seems there is a difference in sentence tokenization when neuralcoref is used in a pipeline vs when a spacy sentencizer is used.

In [179]: print(neuralcoref.__version__)
4.1.0

In [180]: print(spacy.__version__)
2.3.5
In [181]: content
Out[181]: "Atletico Madrid striker Mario Mandzukic is having a medical in Turin on Monday ahead his proposed switch to Serie A champions Juventus. The Croatia international arrived in Italy on Sunday and was pictured outside the Clinica Fornaca di Sessant the following day. There had been much speculation over the Croat's future following an ongoing dispute with current manager Diego Simeone who claimed the striker 'annoyed him easily.' Manchester United were linked with a possible move as they continue to scour the transfer market to find a suitable replacement for Radamel Falcao who failed to earn an extended stay. However, the Italian side took to Twitter on Sunday night to confirm the former Bayern Munich striker had arrived in Italy - posting a picture alongside the caption 'Welcome, Mario.'\n@highlight\nMario Mandzukic scored 20 goals for Atletico Madrid last season\n@highlight\nHe is now set to join Italian side Juventus in a deal worth\xa0€18million (£13m)\n@highlight\nThe forward is undergoing a medical in Turin on Monday\n@highlight\nJuventus confirmed the Croatian striker had arrived in Italy on Sunday\n@highlight\nThe club's boss Massimiliano Allegri confirmed the deal earlier this week"

with neuralcoref in the pipeline

In [185]: coref_nlp = spacy.load('en')

In [186]: coref = neuralcoref.NeuralCoref(coref_nlp.vocab)

In [187]: coref_nlp.add_pipe(coref, name='neuralcoref')

In [188]: for index, x in enumerate(coref_nlp(content).sents):
     ...:     print(index, x)
     ...: 
0 Atletico Madrid striker Mario Mandzukic is having a medical in Turin on Monday ahead his proposed switch to Serie A champions Juventus.
1 The Croatia international arrived in Italy on Sunday and was pictured outside the Clinica Fornaca di Sessant the following day.
2 There had been much speculation over the Croat's future following an ongoing dispute with current manager Diego Simeone who claimed the striker 'annoyed him easily.'
3 Manchester United were linked with a possible move as they continue to scour the transfer market to find a suitable replacement for Radamel Falcao who failed to earn an extended stay.
4 However, the Italian side took to Twitter on Sunday night to confirm the former Bayern Munich striker had arrived in Italy - posting a picture alongside the caption 'Welcome, Mario.'

5 @highlight

6 Mario Mandzukic scored 20 goals for Atletico Madrid last season

7 @highlight

8 He is now set to join Italian side Juventus in a deal worth 
9 €18million (£13m)

10 @highlight

11 The forward is undergoing a medical in Turin on Monday

12 @highlight

13 Juventus confirmed the Croatian striker had arrived in Italy on Sunday

14 @highlight

15 The club's boss Massimiliano Allegri confirmed the deal earlier this week

with sentencizer in the pipeline

In [189]: from spacy.pipeline import Sentencizer

In [190]: normal_nlp = spacy.load('en')

In [191]: normal_nlp.add_pipe(Sentencizer(), last=True)

In [192]: for index, x in enumerate(normal_nlp(content).sents):
     ...:     print(index, x)
     ...: 
0 Atletico Madrid striker Mario Mandzukic is having a medical in Turin on Monday ahead his proposed switch to Serie A champions Juventus.
1 The Croatia international arrived in Italy on Sunday and was pictured outside the Clinica Fornaca di Sessant the following day.
2 There had been much speculation over the Croat's future following an ongoing dispute with current manager Diego Simeone who claimed the striker 'annoyed him easily.'
3 Manchester United were linked with a possible move as they continue to scour the transfer market to find a suitable replacement for Radamel Falcao who failed to earn an extended stay.
4 However, the Italian side took to Twitter on Sunday night to confirm the former Bayern Munich striker had arrived in Italy - posting a picture alongside the caption 'Welcome, Mario.'
5 

6 @highlight

7 Mario Mandzukic scored 20 goals for Atletico Madrid last season

8 @highlight

9 He is now set to join Italian side Juventus in a deal worth 
10 €18million (£13m)

11 @highlight

12 The forward is undergoing a medical in Turin on Monday

13 @highlight

14 Juventus confirmed the Croatian striker had arrived in Italy on Sunday

15 @highlight

16 The club's boss Massimiliano Allegri confirmed the deal earlier this week

In [193]: 

As you can see, there's a difference in the output: the second one has a sentence more than the first one. Any idea why this might be happening? Does neuralcoref not use spacy sentencizer internally?

sagnik commented 3 years ago

With Spacy v3.0.6, the tokenization is again different

>>> print(spacy.__version__)
3.0.6
>>> normal_nlp = spacy.load('en_core_web_sm')
>>> sentencizer = normal_nlp.add_pipe("sentencizer")
>>> sentencizer
<spacy.pipeline.sentencizer.Sentencizer object at 0x7fb251fb0040>
>>> for index, x in enumerate(normal_nlp(content).sents):
...     print(index, x)
... 
0 Atletico Madrid striker Mario Mandzukic is having a medical in Turin on Monday ahead his proposed switch to Serie A champions Juventus.
1 The Croatia international arrived in Italy on Sunday and was pictured outside the Clinica Fornaca di Sessant the following day.
2 There had been much speculation over the Croat's future following an ongoing dispute with current manager Diego Simeone who claimed the striker 'annoyed him easily.'
3 Manchester United were linked with a possible move as they continue to scour the transfer market to find a suitable replacement for Radamel Falcao who failed to earn an extended stay.
4 However, the Italian side took to Twitter on Sunday night to confirm the former Bayern Munich striker had arrived in Italy - posting a picture alongside the caption 'Welcome, Mario.'
5 

6 @highlight
Mario Mandzukic scored 20 goals for Atletico Madrid last season

7 @highlight

8 He is now set to join Italian side Juventus in a deal worth €18million (£13m)

9 @highlight

10 The forward is undergoing a medical in Turin on Monday

11 @highlight
Juventus confirmed the Croatian striker had arrived in Italy on Sunday

12 @highlight
13 

14 The club's boss Massimiliano Allegri confirmed the deal earlier this week
>>> 
sagnik commented 3 years ago

This gets solved if I add a Sentecizer in the pipeline before the neuralcoref module.