mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

correct_spaces incorrectly inserts spaces into abbreviations #44

Open atlijas opened 1 year ago

atlijas commented 1 year ago

Using the newest version of Tokenizer, 3.4.2:

from tokenizer import correct_spaces
>>> correct_spaces('Þarna voru t.d. tveir hundar , m.a. hundurinn hans Jóns .')
# Expected output: 'Þarna voru t.d. tveir hundar, m.a. hundurinn hans Jóns.'
# Output:          'Þarna voru t. d. tveir hundar, m. a. hundurinn hans Jóns.'
HaukurPall commented 1 year ago

Thanks for reporting this. We will look into this