ufal / morphodita

MorphoDiTa: Morphologic Dictionary and Tagger
Mozilla Public License 2.0
69 stars 7 forks source link

Tokenizer: Don't split extended grapheme clusters #19

Closed dlukes closed 2 years ago

dlukes commented 2 years ago

Currenty, MorphoDiTa tokenizers split e.g. 🇵🇱 into 🇵 and 🇱. It would be nice if they didn't :)

(Cf. e.g. here for some background.)

dlukes commented 2 years ago

Dammit, and GitHub butchers 🇵🇱 for some inexplicable reason when displaying it (why do I get the string "poland" when I try to copy it from the comment?) :/ To be clear, this is what I meant: https://emojiguide.org/flag-poland

foxik commented 2 years ago

This is definitely a valid point. However, I do not think I will update MorphoDiTa this way (the change would need to be opt-in of course, and there are many problems with the tokenizer -- it is a bit too eager right now, splitting digits from letters, and splitting all non-alphanumeric codepoints as separate tokens; it also incorrectly handles various unicode characters, the URL detection is sometimes not optimal, ...) -- the reason is that I want the development to move to UDPipe 3 (i.e., it will be possible to load morphological dictionaries there, we also plan morphological generation without a dictionary for other languages, better tokenization including preserving of non-textual data [like when processing a HTML/XML]). Regarding tokenizers, the plan is to implement exactly the EGC splitting, word splitting and sentence splitting from UAX #29, and allow a machine learning model to learn exceptions (when considering all ~70 non-low-resource UD languages); for Czech, we might still have also a rule-based tokenizer.

dlukes commented 2 years ago

Understood, makes sense! Thank you for the roadmap sketch :) As far as I'm concerned, an improved tokenizer as part of UDPipe 3 is a good alternative. Feel free to close this as wontfix then.

foxik commented 2 years ago

Closing, given that a new "planning" issue was created in the LinPipe repository (in the end the new tool will not be UDPipe 3, but a newly-created project LinPipe, which will contain among others UDPipe 3 models).

dlukes commented 2 years ago

Thank you for the update! I'll follow the developments in the new repo then :)