Closed sagnik closed 3 years ago
With Spacy v3.0.6, the tokenization is again different
>>> print(spacy.__version__)
3.0.6
>>> normal_nlp = spacy.load('en_core_web_sm')
>>> sentencizer = normal_nlp.add_pipe("sentencizer")
>>> sentencizer
<spacy.pipeline.sentencizer.Sentencizer object at 0x7fb251fb0040>
>>> for index, x in enumerate(normal_nlp(content).sents):
... print(index, x)
...
0 Atletico Madrid striker Mario Mandzukic is having a medical in Turin on Monday ahead his proposed switch to Serie A champions Juventus.
1 The Croatia international arrived in Italy on Sunday and was pictured outside the Clinica Fornaca di Sessant the following day.
2 There had been much speculation over the Croat's future following an ongoing dispute with current manager Diego Simeone who claimed the striker 'annoyed him easily.'
3 Manchester United were linked with a possible move as they continue to scour the transfer market to find a suitable replacement for Radamel Falcao who failed to earn an extended stay.
4 However, the Italian side took to Twitter on Sunday night to confirm the former Bayern Munich striker had arrived in Italy - posting a picture alongside the caption 'Welcome, Mario.'
5
6 @highlight
Mario Mandzukic scored 20 goals for Atletico Madrid last season
7 @highlight
8 He is now set to join Italian side Juventus in a deal worth €18million (£13m)
9 @highlight
10 The forward is undergoing a medical in Turin on Monday
11 @highlight
Juventus confirmed the Croatian striker had arrived in Italy on Sunday
12 @highlight
13
14 The club's boss Massimiliano Allegri confirmed the deal earlier this week
>>>
This gets solved if I add a Sentecizer
in the pipeline before the neuralcoref
module.
I have
neuralcoref
(4.1.0) installed from github with spacy 2.3.5. Seems there is a difference in sentence tokenization when neuralcoref is used in a pipeline vs when a spacy sentencizer is used.with neuralcoref in the pipeline
with sentencizer in the pipeline
As you can see, there's a difference in the output: the second one has a sentence more than the first one. Any idea why this might be happening? Does
neuralcoref
not use spacy sentencizer internally?