nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Control characters break German segmentation #121

Open edloginova opened 1 year ago

edloginova commented 1 year ago

Describe the bug Control characters like \x1f break German sentence segmentation at format_numbered_list_with_periods step.

To Reproduce Steps to reproduce the behavior: Input text - '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'

Code:

import pysbd
example_text = '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True)
sents_char_spans = segmenter.segment(example_text)      

Expected behavior Expected output: ['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']

Additional context pysbd version: '0.3.4' Python 3.8.10 Windows/Linux both tried

Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│   1 segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True)                      │
│ ❱ 2 sents_char_spans = segmenter.segment(example_text)                                           │
│   3                                                                                              │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\segme │
│ nter.py:87 in segment                                                                            │
│                                                                                                  │
│   84 │   │   if self.clean or self.doc_type == 'pdf':                                            │
│   85 │   │   │   text = self.cleaner(text).clean()                                               │
│   86 │   │                                                                                       │
│ ❱ 87 │   │   postprocessed_sents = self.processor(text).process()                                │
│   88 │   │   sentence_w_char_spans = self.sentences_with_char_spans(postprocessed_sents)         │
│   89 │   │   if self.char_span:                                                                  │
│   90 │   │   │   return sentence_w_char_spans                                                    │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\proce │
│ ssor.py:33 in process                                                                            │
│                                                                                                  │
│    30 │   │   │   return self.text                                                               │
│    31 │   │   self.text = self.text.replace('\n', '\r')                                          │
│    32 │   │   li = ListItemReplacer(self.text)                                                   │
│ ❱  33 │   │   self.text = li.add_line_break()                                                    │
│    34 │   │   self.replace_abbreviations()                                                       │
│    35 │   │   self.replace_numbers()                                                             │
│    36 │   │   self.replace_continuous_punctuation()                                              │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:61 in add_line_break                                                           │
│                                                                                                  │
│    58 │   def add_line_break(self):                                                              │
│    59 │   │   self.format_alphabetical_lists()                                                   │
│    60 │   │   self.format_roman_numeral_lists()                                                  │
│ ❱  61 │   │   self.format_numbered_list_with_periods()                                           │
│    62 │   │   self.format_numbered_list_with_parens()                                            │
│    63 │   │   return self.text                                                                   │
│    64                                                                                            │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:80 in format_numbered_list_with_periods                                        │
│                                                                                                  │
│    77 │   │   │   │   │   │   '♨', strip=True)                                                   │
│    78 │                                                                                          │
│    79 │   def format_numbered_list_with_periods(self):                                           │
│ ❱  80 │   │   self.replace_periods_in_numbered_list()                                            │
│    81 │   │   self.add_line_breaks_for_numbered_list_with_periods()                              │
│    82 │   │   self.text = Text(self.text).apply(self.SubstituteListPeriodRule)                   │
│    83                                                                                            │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:76 in replace_periods_in_numbered_list                                         │
│                                                                                                  │
│    73 │   │   self.text = Text(self.text).apply(self.ListMarkerRule)                             │
│    74 │                                                                                          │
│    75 │   def replace_periods_in_numbered_list(self):                                            │
│ ❱  76 │   │   self.scan_lists(self.NUMBERED_LIST_REGEX_1, self.NUMBERED_LIST_REGEX_2,            │
│    77 │   │   │   │   │   │   '♨', strip=True)                                                   │
│    78 │                                                                                          │
│    79 │   def format_numbered_list_with_periods(self):                                           │
│                                                                                                  │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:114 in scan_lists                                                              │
│                                                                                                  │
│   111 │                                                                                          │
│   112 │   def scan_lists(self, regex1, regex2, replacement, strip=False):                        │
│   113 │   │   list_array = re.findall(regex1, self.text)                                         │
│ ❱ 114 │   │   list_array = list(map(int, list_array))                                            │
│   115 │   │   for ind, item in enumerate(list_array):                                            │
│   116 │   │   │   # to avoid IndexError                                                          │
│   117 │   │   │   # ruby returns nil if index is out of range                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: '\x1d2'