Open niyumard opened 4 years ago
Hi @niyumard ! Thanks for logging the issue. Right now you're absolutely right. Not only that, but the way I'm breaking up word-forms is specifically English based. I don't have the knowledge to re-implement that in a way that would support other languages, let alone RTL ones.
That being said, it's about time I look into finding a way to allow for localization to be submitted. If I can make it easy for others to contribute their own language parts we can start tackling this.
While I haven't really made progress on Persian, I have added some basic locale support to Stutter. More work will be required to handle RTL, but if you have any LTR languages you want to work with, then all you need to do is modify the JSON object at the top of parts.js. i found a list of common prefixes and suffixes for Spanish, and left all other word-splitting behavior the same as English. Hopefully others will manage to PR in other languages!
I suggest that you let users use their font of choice, I think that might help.
I also tried changing "__stutter_right" to "__stutter_left" and it helps! although there's a problem again because Persian/Arabic script doesn't use block letters but is cursive in its nature.
You may be able to solve the cursive problem by using this character: "ـ" https://en.wikipedia.org/wiki/Kashida Which for example when added to س makes it سـ which is perfect for the start or middle of a word س itself being used in the end of a word.
Ahh, so I'll need to make my word divide character into a configurable value in the json object as well. That's very good to know.
Other than the display being in the wrong direction, is Stutter reading through Persian text in the correct direction so that each word is in the correct order? If so, i think the steps needed to add support would be:
Can you think of anything else?
Is Stutter reading through Persian text in the correct direction so that each word is in the correct order?
Yes the order is right.
can display the characters properly
The characters are displayed properly but I'd rather see them in another font, this one's too ugly for Persian texts, so maybe this one's not that much of a priority but if you can make it happen it'd be great.
Can you think of anything else?
Not really, I'm not sure how stutter divides words.
I've added more information to the README regarding localization. I moved the locales content to its own JSON file as well. I'll need to add more features in for Persian than are currently available, but if you'd like to start creating a "fa"
entry that would be helpful. I assume the first regular expression will still work since it's just splitting on whitespace. The second one which splits on "." or "," will probably need to be changed. Finally, the presub section will need a lot of love.
That stuff collectively is the 4th item in the checklist above. I'll have to do 1-3 myself.
Well I can't master regex at the moment it seems, how about I write down Persian alphabet and common prefixes here?
Lets start with that and see how it goes. :)
Here are the Persian alphabet:
ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی But some may also use these characters too: ك ء ة آ إ ي ئ ؤ
Complex words maybe separated in two ways, the correct way is by zero-width non-joiner but some may separate inside a word with space or some may not use any, for example: correct form for a prefix:
میخواهم
but people also use:
می خواهم
and
میخواهم
correct form for a suffix:
کتابها
but people also use:
کتاب ها
or
کتابها
so here are some prefixes:
می
and here are some suffixes:
ها های تر ترین کده گان گانه گر وار ستان
anytime there's a zero-width nonjoiner you can easily separate that word in two parts although they obviously should come together for example:
کممحبت = کم + محبت
I hope that it helps!
I think I've found the main problem. It seems that separating words down to letters (or a group of letters) isn't a good idea for cursive scripts in which the letters change shape according to their position in the word. When the extension tries to make one letter red, it does so by separating that single letter, so it gets separated and is shown in the wrong way. In Persian and languages with Arabic script in general, the letters change shape according to their adjacent letters.
For example in the word کتاب, the letter ت becomes ـتـ when it's medial and surrounded with certain other letters. The same thing goes for other letters as well. They change shape according to their position in the word as mentioned in the wiki.
What we need is to introduce Keshida in stuttter. So the solution for the letter ت is that if it's isolated it doesn't need any kashida. if it's the first letter, then it needs to be connected to the next letter, in that case the browser itself processes it in the right way. If you copy تا and remove the first character, you can see what happens. If we want to separate it though, we need one keshida, تـا and if it's in the middle it needs to keshida charachters, one before and one after it: ـتـ
Hi, I see Persian texts like this using stutter:
It seems stutter doesn't support RTL languages and doesn't use a suitable font for them.