Support for cursive scripts

thesnarky1 commented 10 years ago

ROT works nicely for languages that are written left-to-right in a non-connecting script. It handles Unicode just fine (as long as the browser does) so that covers a lot of ground. However, two areas that I believe would open this up to a much broader global audience would be support for a native right-to-left rendering and the ability to nicely display characters from languages that use cursive scripts (such as Arabic, Farsi, Hindi, etc). This ticket is for cursive scripts to keep the issues able to be separately pulled.

Arabic is my example below, but it holds for any cursive script wherein the letters change based on position(for instance an 'a' at the start of a word appears differently than an 'a' at the end).

Issue

The crux of this issue is that while a browser correctly knows how to display a string of Unicode characters while in the context of that string, pulling each character out to stand on its own breaks the connectivity. Currently ROT.Display.drawText prints each character on its own, leading to only the isolated form of that character being printed, regardless of the context it was pulled from.

To illustrate the point, in the code below you can see how the characters join together

//For the sake of these examples, Test is a new Rect Display of size 30x10
Test._drawText(1, 1, "Hello World");
Test.display.draw(10, 3, 'مرحبا يا عالم'); //Included hack to show the difference, see below
Test._drawText(5, 5, 'مرحبا يا عالم');

However, after running the code, we see this: cursive_non_connecting

The middle row is how they'd look connected, the bottom row is the issue because it used drawText. This occurs because the Unicode characters being used and displayed nicely in a browser are from the "Arabic" Unicode range (0600–06FF) and the browser is correctly interpreting it in context. Because ROT.Display.drawText grabs each character separately one needs to replace it with the proper contextual character from the Arabic Presentation Forms B range (FE70–FEFF).

As an example of how to fix this, I wrote up a quick Arabic lookup table [https://gist.github.com/thesnarky1/10012004] that takes standard Unicode and checks the context to replace the character. In this case getRealCharCodes checks the string for anything it can translate, then replaces those characters based on the context it sees them in.

//For the sake of these examples, Test is a new Rect Display of size 30x10
Test._drawText(1, 1, "Hello World");
Test.display.draw(10, 3, 'مرحبا يا عالم'); //Included hack to show the difference, see below
Test._drawText(5, 5, 'مرحبا يا عالم');
Test._drawText(5, 6, getRealCharCodes('مرحبا يا عالم'));

cursive_connecting

Even without understanding an Arabic alphabet, you can see that the characters on the bottom line are now changed to connect to something (even if right-to-left is still not working).

Proposed Solution

One hack to bypass this is to call ROT.Display.draw() and provide a string instead of a character. This prints the entire string as one unit which correctly connects the letters. Unfortunately this also negates the ability to do any sort of intelligent padding because spacing is not considered. This is really not a solution, just a work-around.

This is something that could be left up to each developer to figure out, however I disagree with that approach because I believe one's native language should have a low-barrier for entry to programming.

I would propose allowing for a community-built approach wherein ROT provides a basic framework to do this character substitution, provided the community provides the appropriate translations. Essentially it would amount to adding two new variables in ROT.Text (for instance ROT.Text.cursiveCharactersToTranslate and ROT.Text.cursiveCharactersWhichDontConnect), adding a method to ROT.Text that would perform the lookup, and then allowing anyone to put language packs into the addon directory.

Each language pack would consist of additions to that variable indexed by the base Unicode letter (so there should be no collisions), as well as additions to the list of characters that don't connect. For example, Arabic could be a small Javascript file containing:

ROT.Text.cursiveCharactersToTranslate[1575] = [1575, 65165, 65166, 65166, 65165];
ROT.Text.cursiveCharactersToTranslate[1576] = [1576, 65169, 65170, 65168, 65167];
//...
ROT.Text.cursiveCharactersToTranslate[1610] = [1610, 65267, 65268, 65266, 65265];
ROT.Text.cursiveCharactersWhichDontConnect.push([1575, 1583, 1584, 1585, 1586, 1608]);

This would keep the burden of maintenance of specific languages on those who actually know the language, while providing a very key improvement to ROT overall. It would also stay small because if someone did not use any special languages, the only additional size to their ROT would be the function in ROT.Text.

I'm more than happy to draft up a pull request for this, wanted to check interest before I did, however. Also wanted to debate whether this is better as part of ROT.Text or ROT.Display adding it into ROT.Display.DrawText as an additional boolean variable.

Impact

As for why I believe this is important, two of the top five most spoken languages in the world (Arabic and Hindi) are cursive (estimated around 700 million people)(http://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers). Allowing for this support would enable much closer to native development (outside of Javascript using English syntax) for a swath of the world stretching from Morocco to India, along with expat communities world-wide that desire to hold on to their heritage.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1527224-support-for-cursive-scripts?utm_campaign=plugin&utm_content=tracker%2F297828&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F297828&utm_medium=issues&utm_source=github).

ondras commented 10 years ago

How does this relate to line breaking? I suppose that contextually-replaced characters only apply to the case when two (or more?) related glyphs are not broken-within, right?

thesnarky1 commented 10 years ago

I can see two answers: The basic answer is "yes, it only replaces characters within strings of the same glyphs". So if you were to call this and had a newline inserted in the middle ofa word, as soon as it got to the new line the last letter before the newline would be in FINAL form and the first letter after the newline would be in INITIAL form.

The other way to answer this is that I hadn't considered what happened if drawText was used with a max width after contextually replacing the characters. That would end up in a situation where you (potentially) have a MEDIAL form letter at the end of the line and a MEDIAL form letter starting the line. It would be ugly. However, that would also make it good and clear that a linebreak occurred, instead of having a completely new (really short) word appearing later.

ondras / rot.js