Open benjore opened 6 years ago
Do we know what the context of this was? Where is our tokenizer causing a problem for minority languages? If I remember correctly, it is only used on alignment which was resigned for the GLs. The selection tool was suppose to be tokenizer free because of this issue.
Can we find out if this letter is ever used at the beginning or end of a word or always in the middle?
Long term we can’t assume that even all GLs will behave the same way so we will have to address custom tokenizers soon.
If this is a rule where all occurrences of - will be a character and not punctuation, we can more easily allow a configurable list of extra word characters for overriding the punctuation classification.
@klappy This is all I have:
"I'm playing with our language from the Philippines, which (like many there), uses a hyphen to mark a glottal stop (a consonant), but tC breaks words at hyphens. Or did I miss a setting?"
But I know that Hindi had some issues with tokenization and likely other languages will as well. I'm not so much looking for a one-story-to-fix-all-issues, but rather I'm looking for a story that takes us the next step closer.
User Story
As a speaker of a minority language in the Philippines that uses a
-
as a letter, I want to be able to customize the tokenization of tC so that many of the words in my language are not inaccurately split apart.(Note Helpdesk issue 633)