User Defined Tokenization Refinement

unfoldingWord / translationCore

Repository for the desktop application translationCore

https://www.translationcore.com

Other

36 stars 11 forks source link

User Defined Tokenization Refinement #4486

Open benjore opened 6 years ago

benjore commented 6 years ago

User Story

As a speaker of a minority language in the Philippines that uses a - as a letter, I want to be able to customize the tokenization of tC so that many of the words in my language are not inaccurately split apart.

(Note Helpdesk issue 633)

klappy commented 6 years ago

Do we know what the context of this was? Where is our tokenizer causing a problem for minority languages? If I remember correctly, it is only used on alignment which was resigned for the GLs. The selection tool was suppose to be tokenizer free because of this issue.

Can we find out if this letter is ever used at the beginning or end of a word or always in the middle?

Long term we can’t assume that even all GLs will behave the same way so we will have to address custom tokenizers soon.

klappy commented 6 years ago

If this is a rule where all occurrences of - will be a character and not punctuation, we can more easily allow a configurable list of extra word characters for overriding the punctuation classification.

benjore commented 6 years ago

@klappy This is all I have:

"I'm playing with our language from the Philippines, which (like many there), uses a hyphen to mark a glottal stop (a consonant), but tC breaks words at hyphens. Or did I miss a setting?"

But I know that Hindi had some issues with tokenization and likely other languages will as well. I'm not so much looking for a one-story-to-fix-all-issues, but rather I'm looking for a story that takes us the next step closer.