nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Spacy integration example is broken #96

Open iibrahimli opened 3 years ago

iibrahimli commented 3 years ago

Describe the bug The example script is currently broken with the current latest versions of spacy and pysbd. Adding a pipe to spacy model throws an exception. Moreover, sentences are not split when extra spaces are present before/after them.

To Reproduce Steps to reproduce the behavior: Run the examples/pysbd_as_spacy_component.py script. This will result in the exception being thrown. If the pipeline addition is fixed and the code proceeds further, the sentence segmentation will not match the output of using pysbd module without spacy.

Expected behavior Expected the code to not raise an exception, and the output to be correct.

Example: Input text - "Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S. I live in New York."

Expected output - ["Hello world.", "My name is Mr. Smith.", "I work for the U.S. Government and I live in the U.S.", "I live in New York."]

Actual output - ["Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S.", "I live in New York."]

Additional context Versions tested: spacy==3.0.6 pysbd==0.3.4

The error thrown when trying to add pipe to the spacy model:

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function pysbd_sentence_boundaries at 0x7f0498c62158> (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.