mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

added handling for abbreviations #47

Closed thorunna closed 11 months ago

thorunna commented 11 months ago

Pull request to improve correct_spaces() so that it splits abbreviations correctly. Previous handling incorrectly split e.g. 't.d.' into 't. d.', but should leave it now.

thorunna commented 11 months ago

The only change that is not related to formatting, which I can't seem to reverse, is in lines 3059 and 3060.

vthorsteinsson commented 11 months ago

Looks really good! But it would be great to add tests for this to the test suite.

vthorsteinsson commented 11 months ago

The formatting changes are due to Black being set to a line length of 120 instead of 88, which was the original default. I'm not sure that it's a good idea to change the line length.

thorunna commented 11 months ago

Ok to merge?

vthorsteinsson commented 11 months ago

Yes, looks good!