nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
782 stars 82 forks source link

Regexp issues #56

Closed mollerhoj closed 4 years ago

mollerhoj commented 4 years ago

I'm getting errors because the regexp engine interprets parentesis: "unterminated subpattern" and "unbalanced parenthesis".

I'm analysing very large amounts of text, so not sure how these were triggered.

mollerhoj commented 4 years ago
File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/segmenter.py", line 42, in segment
    segments = processor.process()
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/processor.py", line 44, in process
    self.text = AbbreviationReplacer(self.text).replace()
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 61, in replace
    self.text = self.search_for_abbreviations_in_string()
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 96, in search_for_abbreviations_in_string
    self.text, match, ind, char_array
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 114, in scan_for_replacements
    txt = replace_period_of_abbr(txt, am)
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 36, in replace_period_of_abbr
    txt,
  File "/usr/lib/python3.5/re.py", line 182, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.5/re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.5/sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.5/sre_parse.py", line 834, in parse
    raise source.error("unbalanced parenthesis")
mollerhoj commented 4 years ago
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 61, in replace
    self.text = self.search_for_abbreviations_in_string()
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 96, in search_for_abbreviations_in_string
    self.text, match, ind, char_array
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 114, in scan_for_replacements
    txt = replace_period_of_abbr(txt, am)
  File "/home/mollerhoj/.local/lib/python3.5/site-packages/pysbd/abbreviation_replacer.py", line 36, in replace_period_of_abbr
    txt,
  File "/usr/lib/python3.5/re.py", line 182, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.5/re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.5/sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.5/sre_parse.py", line 829, in parse
    p = _parse_sub(source, pattern, 0)
  File "/usr/lib/python3.5/sre_parse.py", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "/usr/lib/python3.5/sre_parse.py", line 722, in _parse
    source.tell() - start)
sre_constants.error: missing ), unterminated subpattern at position 0
nipunsadvilkar commented 4 years ago

@mollerhoj If you can provide an example that would be helpful to debug the issue. I most likely need to use re.escape in replace_period_of_abbr function for those kinds of edge cases

nipunsadvilkar commented 4 years ago

Closing. Feel free to open with more info