nytud / quntoken

Hungarian tokenizer.
https://pypi.org/project/quntoken/
GNU General Public License v3.0
14 stars 5 forks source link

1908-1914 is not split #41

Open DavidNemeskey opened 3 years ago

DavidNemeskey commented 3 years ago

I am not sure about words separated by hyphens (e.g. Fekete-tenger vs. Budapest-Székesfehérvár-Siófok), but I think that numbers should definitely be split up along hyphens. The example in the title should obviously have an en-dash instead of a hyphen, but due to limitations of keyboards and/or layouts, people do use hyphens to mean from-to. So the correct token sequence should be "1908", "-", "1914", whereas ATM it is "1908-1914".

DavidNemeskey commented 3 years ago

Same with '/'; for instance the text Bob Marley/Lee Perry should be tokenized into 5 tokens, not 3.

dlazesz commented 3 years ago

Is #31 related/duplicate?

dlazesz commented 2 years ago

23 is also related/duplicate