Closed Jasperty closed 5 years ago
spaCy's sentencizer relies on the dependency parse to determine where the boundaries of sentences are. When you disable the parser
component of the pipeline, it can no longer do this segmentation. If you really don't want the parser on (perhaps for speed), you can implement a custom sentence splitter in spaCy.
Incidentally, I'm not sure whether the coref will work with the vectors disabled since they're the features that are used to do the coref.
As @ahalterman says you can enable the sentence boundary detection by including the parser. You can also add just the sentence boundary detection like so:
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
However, it is not clear to me whether this will generate the same sentence boundaries as the case where the parser is included.
Hello, I have this error, what should I do? Below is my code: STOP_WORDS = stopwords.words('english') nlp = English() nlp.add_pipe(nlp.create_pipe('sentencizer'))
def normalize(text):#对文本进行处理,返回列表,列表的每个元素时字符串 text = text.lower().strip() doc = nlp(text)#doc具有了nlp的属性方法 filtered_sentences = [] for sentence in tqdm(doc.sents):##################The error is here filtered_tokens = list() for i, w in enumerate(sentence): s = w.string.strip() if len(s) == 0 or s in string.punctuation and i < len(doc) - 1:# string.punctuation 所有的标点字符 continue if s not in STOP_WORDS: s = s.replace(',', '.') filtered_tokens.append(s) filtered_sentences.append(' '.join(filtered_tokens))
Hello, I have this error, what should I do? Below is my code: STOP_WORDS = stopwords.words('english') nlp = English() nlp.add_pipe(nlp.create_pipe('sentencizer'))
def normalize(text):#对文本进行处理,返回列表,列表的每个元素时字符串 text = text.lower().strip() doc = nlp(text)#doc具有了nlp的属性方法 filtered_sentences = [] for sentence in tqdm(doc.sents):##################The error is here filtered_tokens = list() for i, w in enumerate(sentence): s = w.string.strip() if len(s) == 0 or s in string.punctuation and i < len(doc) - 1:# string.punctuation 所有的标点字符 continue if s not in STOP_WORDS: s = s.replace(',', '.') filtered_tokens.append(s) filtered_sentences.append(' '.join(filtered_tokens))
@DomHudson @ahalterman
This should also be fixed in the new release (4.0) and SpaCy 2.1+. Please open a new issue if there is still a problem.
Hi,
I am using the sentencizer from spacy to split the document into sentences. The default delimiters in sentencizer is (' . ' , ' ! ' , ' ? ' ). But if i gave a sentence like:
"A fawn was racing in the forest!He was ahead of the rabbit?He was ahead of the elephant."
Its not splitting into 3 sentences. Can anyone help for this.
Thanks in advance.
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
print(sentences)
Above should work
Thanks @mk9440, adding "nlp.add_pipe(nlp.create_pipe('sentencizer'))" this line worked~
spacy_model_name = "en_coref_md" disable = ['vectors', 'textcat', 'tagger', 'parser', 'ner'] model = spacy.load(spacy_model_name, disable=disable) doc = model('My sister has a dog, she loves him.')
and get: ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
i do not know how to add sentencizer