Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer'))

huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks

https://huggingface.co/coref/

MIT License

2.86k stars 478 forks source link

Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) #117

Closed Jasperty closed 5 years ago

Jasperty commented 5 years ago

spacy_model_name = "en_coref_md" disable = ['vectors', 'textcat', 'tagger', 'parser', 'ner'] model = spacy.load(spacy_model_name, disable=disable) doc = model('My sister has a dog, she loves him.')

and get: ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

i do not know how to add sentencizer

ahalterman commented 5 years ago

spaCy's sentencizer relies on the dependency parse to determine where the boundaries of sentences are. When you disable the parser component of the pipeline, it can no longer do this segmentation. If you really don't want the parser on (perhaps for speed), you can implement a custom sentence splitter in spaCy.

Incidentally, I'm not sure whether the coref will work with the vectors disabled since they're the features that are used to do the coref.

DomHudson commented 5 years ago

As @ahalterman says you can enable the sentence boundary detection by including the parser. You can also add just the sentence boundary detection like so:

nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

However, it is not clear to me whether this will generate the same sentence boundaries as the case where the parser is included.

yangchangli commented 5 years ago

Hello, I have this error, what should I do? Below is my code： STOP_WORDS = stopwords.words('english') nlp = English() nlp.add_pipe(nlp.create_pipe('sentencizer'))

def normalize(text):#对文本进行处理，返回列表，列表的每个元素时字符串 text = text.lower().strip() doc = nlp(text)#doc具有了nlp的属性方法 filtered_sentences = [] for sentence in tqdm(doc.sents):##################The error is here filtered_tokens = list() for i, w in enumerate(sentence): s = w.string.strip() if len(s) == 0 or s in string.punctuation and i < len(doc) - 1:# string.punctuation 所有的标点字符 continue if s not in STOP_WORDS: s = s.replace(',', '.') filtered_tokens.append(s) filtered_sentences.append(' '.join(filtered_tokens))

yangchangli commented 5 years ago

Hello, I have this error, what should I do? Below is my code： STOP_WORDS = stopwords.words('english') nlp = English() nlp.add_pipe(nlp.create_pipe('sentencizer'))

def normalize(text):#对文本进行处理，返回列表，列表的每个元素时字符串 text = text.lower().strip() doc = nlp(text)#doc具有了nlp的属性方法 filtered_sentences = [] for sentence in tqdm(doc.sents):##################The error is here filtered_tokens = list() for i, w in enumerate(sentence): s = w.string.strip() if len(s) == 0 or s in string.punctuation and i < len(doc) - 1:# string.punctuation 所有的标点字符 continue if s not in STOP_WORDS: s = s.replace(',', '.') filtered_tokens.append(s) filtered_sentences.append(' '.join(filtered_tokens))

@DomHudson @ahalterman

thomwolf commented 5 years ago

This should also be fixed in the new release (4.0) and SpaCy 2.1+. Please open a new issue if there is still a problem.

PAVITHRA-CP commented 5 years ago

Hi,

I am using the sentencizer from spacy to split the document into sentences. The default delimiters in sentencizer is (' . ' , ' ! ' , ' ? ' ). But if i gave a sentence like:

"A fawn was racing in the forest!He was ahead of the rabbit?He was ahead of the elephant."

Its not splitting into 3 sentences. Can anyone help for this.

Thanks in advance.

mayank-k-jha commented 4 years ago

from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
print(sentences)

Above should work

marvely commented 4 years ago

Thanks @mk9440, adding "nlp.add_pipe(nlp.create_pipe('sentencizer'))" this line worked~