Open edloginova opened 1 year ago
Describe the bug Control characters like \x1f break German sentence segmentation at format_numbered_list_with_periods step.
\x1f
format_numbered_list_with_periods
To Reproduce Steps to reproduce the behavior: Input text - '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
'1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
Code:
import pysbd example_text = '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana' segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True) sents_char_spans = segmenter.segment(example_text)
Expected behavior Expected output: ['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
Additional context pysbd version: '0.3.4' Python 3.8.10 Windows/Linux both tried
Traceback (most recent call last) ────────────────────────────────╮ │ in <module> │ │ │ │ 1 segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True) │ │ ❱ 2 sents_char_spans = segmenter.segment(example_text) │ │ 3 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\segme │ │ nter.py:87 in segment │ │ │ │ 84 │ │ if self.clean or self.doc_type == 'pdf': │ │ 85 │ │ │ text = self.cleaner(text).clean() │ │ 86 │ │ │ │ ❱ 87 │ │ postprocessed_sents = self.processor(text).process() │ │ 88 │ │ sentence_w_char_spans = self.sentences_with_char_spans(postprocessed_sents) │ │ 89 │ │ if self.char_span: │ │ 90 │ │ │ return sentence_w_char_spans │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\proce │ │ ssor.py:33 in process │ │ │ │ 30 │ │ │ return self.text │ │ 31 │ │ self.text = self.text.replace('\n', '\r') │ │ 32 │ │ li = ListItemReplacer(self.text) │ │ ❱ 33 │ │ self.text = li.add_line_break() │ │ 34 │ │ self.replace_abbreviations() │ │ 35 │ │ self.replace_numbers() │ │ 36 │ │ self.replace_continuous_punctuation() │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:61 in add_line_break │ │ │ │ 58 │ def add_line_break(self): │ │ 59 │ │ self.format_alphabetical_lists() │ │ 60 │ │ self.format_roman_numeral_lists() │ │ ❱ 61 │ │ self.format_numbered_list_with_periods() │ │ 62 │ │ self.format_numbered_list_with_parens() │ │ 63 │ │ return self.text │ │ 64 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:80 in format_numbered_list_with_periods │ │ │ │ 77 │ │ │ │ │ │ '♨', strip=True) │ │ 78 │ │ │ 79 │ def format_numbered_list_with_periods(self): │ │ ❱ 80 │ │ self.replace_periods_in_numbered_list() │ │ 81 │ │ self.add_line_breaks_for_numbered_list_with_periods() │ │ 82 │ │ self.text = Text(self.text).apply(self.SubstituteListPeriodRule) │ │ 83 │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:76 in replace_periods_in_numbered_list │ │ │ │ 73 │ │ self.text = Text(self.text).apply(self.ListMarkerRule) │ │ 74 │ │ │ 75 │ def replace_periods_in_numbered_list(self): │ │ ❱ 76 │ │ self.scan_lists(self.NUMBERED_LIST_REGEX_1, self.NUMBERED_LIST_REGEX_2, │ │ 77 │ │ │ │ │ │ '♨', strip=True) │ │ 78 │ │ │ 79 │ def format_numbered_list_with_periods(self): │ │ │ │ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │ │ _item_replacer.py:114 in scan_lists │ │ │ │ 111 │ │ │ 112 │ def scan_lists(self, regex1, regex2, replacement, strip=False): │ │ 113 │ │ list_array = re.findall(regex1, self.text) │ │ ❱ 114 │ │ list_array = list(map(int, list_array)) │ │ 115 │ │ for ind, item in enumerate(list_array): │ │ 116 │ │ │ # to avoid IndexError │ │ 117 │ │ │ # ruby returns nil if index is out of range │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: invalid literal for int() with base 10: '\x1d2'
Describe the bug Control characters like
\x1f
break German sentence segmentation atformat_numbered_list_with_periods
step.To Reproduce Steps to reproduce the behavior: Input text -
'1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
Code:
Expected behavior Expected output:
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
Additional context pysbd version: '0.3.4' Python 3.8.10 Windows/Linux both tried