mediacloud / sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Other
225 stars 29 forks source link

_regex_core.error: unterminated character set at position 91 #2

Closed kercos closed 5 years ago

kercos commented 5 years ago

Hi, great work!

I'm having a conflict when using both spacy and sentence_splitter.

The following code:

import spacy
from sentence_splitter import SentenceSplitter

nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a sample sentence.")

splitter = SentenceSplitter(language='en')
splitter.split("This is a sample sentence.")

gives the following error:

Traceback (most recent call last):
  File "src/bug.py", line 14, in <module>
    splitter.split("This is a sample sentence.")
  File "env/lib/python3.7/site-packages/sentence_splitter/__init__.py", line 98, in split
    flags=regex.UNICODE
  File "env/lib/python3.7/site-packages/regex.py", line 275, in sub
    return _compile(pattern, flags, kwargs).sub(repl, string, count, pos,
  File "env/lib/python3.7/site-packages/regex.py", line 507, in _compile
    caught_exception.pos)
_regex_core.error: unterminated character set at position 91
kercos commented 5 years ago

Hi @pypt, by any chance did you have time to look into this?

pypt commented 5 years ago

Hey @kercos, sorry for a late reply, busy times on my end!

I have no proof, but I think spaCy messes with regex library's internals in some way, esp. given that they have pinned a specific regex version in their requirements.txt.

In a recent https://github.com/explosion/spaCy/pull/3218 PR (merged to develop) they got rid of regex dependency altogether, so I'd suggest that you try out installing that. Nightly from https://github.com/explosion/spaCy/tree/develop works for me, so I'd suggest that you try it out too!

Feel free to reopen the issue if you still encounter this issue.

svlandeg commented 5 years ago

I have no proof, but I think spaCy messes with regex library's internals in some way, esp. given that they have pinned a specific regex version in their requirements.txt.

Yep, it used to be so that spaCy globally changes the regex settings: https://github.com/svlandeg/spaCy/blob/master/spacy/lang/char_classes.py#L6

But indeed on develop, the regex library was removed entirely.

kercos commented 5 years ago

Thanks @pypt and @svlandeg for looking into this. I'll give it a try ASAP and let you know.