globalizejs / globalize

A JavaScript library for internationalization and localization that leverages the official Unicode CLDR JSON data
https://globalizejs.com
MIT License
4.8k stars 603 forks source link

UBA (Unicode Bidi Algorithm) compliant Bidi engine #542

Closed tomerm closed 8 years ago

tomerm commented 9 years ago

Display of text in general and particularly bidirectional text (text including characters from Arabic / Hebrew / Urdu etc. languages) on the screen is in responsibility of UBA (http://unicode.org/reports/tr9/) compliant Bidi engine. Such an engine is usually provided out of the box by any standard rendering mechanism coming from OS. For example, any Win32 API based app on Windows platform (i.e. Notepad) gets this support for free.

However, different applications might not leverage Bidi engine provided by operating system. For example:

  1. FF provides its own Bidi engine (especially in the input fields)
  2. Acrobat reader has its own logic for rendering Bidi text and does not use any Bidi engine at all.

    Why do we need Bidi engine in JQuery ?

There are numerous cases in which assuring proper rendering of Bidi text in a web page (according to UBA) might require UBA computations. For example:

  1. Web application decides to implement their own rendering logic. Even though HTML / web browsers are the final rendering technology (and on the single word level the reordering works just fine), to properly display a sentence / paragraph UBA is required.
  2. Web application decides to implement their own cursor movement logic (there are logical and visual cursor movements which are possible. Not all browsers support both of them or support them consistently).
  3. Web application decides to implement browser independent numeric shaping support (relevant for Arabic script only).
  4. Web application leverages additional / different rendering technology (i.e. SVG) in which UBA works differently if at all (as compared to plain HTML in the web browsers)
  5. Web application retrieves data from legacy IBM system (i.e. AS400 or mainframe) on which it is stored in a different way from modern OS (i.e. Android, Windows etc.).

PS. We obviously don't need any Bidi engine for displaying "Hello world" like text on the web app. This is something we get for free from the web browser. However, for more complex cases (I listed several ones above) it is mandatory. Bidi engine is available in many programming languages (Windows OS itself has 3 different ones :-)). For example, C (from ICU4C), Java (from ICU4J), C#, VisualBasic, JS (from Dojo). All of those engines are compliant with UBA (http://unicode.org/reports/tr9/ ) . But it does not mean they implement all UBA or latest UBA or they are 100 % consistent.

rxaviers commented 9 years ago

I'd be happy to accept PR addressing this functionality.

Having said that, a couple of questions:

  1. Does an implementation use CLDR data? (I guess RTL data mentioned on #423)
  2. Is it locale-dependent? I mean does the behavior of this engine change for different locales? (I guess it performs differently if LTR or RTL)
  3. Re: "Not all browsers support both of them or support them consistently". What's the different browser support?

A consideration:

tomerm commented 9 years ago

It is absolutely an optional module. If JQuery does not use SVG for any of its visualization then out of the box non of JQuery code will have any need in / a dependency on Bidi engine.

Bidi engine leverages Unicode data not CLDR. By Unicode data I refer for example to Bidi directionality property well defined in Unicode for each character (for all languages).

Bidi engine is NOT locale dependent. It works the same way for all locales. It does produce different results for different text. But those results depend on the content of the text not external local setting. For example:

  1. display of "hello world !!!" with LTR direction will result in: "hello world !!!" while display with RTL direction will result in "!!! hello world"
  2. display of "HELLO WORLD !!!" with LTR direction will result in: "DLROW OLLEH !!!" while display with RTL direction will result in "!!! DLROW OLLEH".

FF uses its own Bidi engine while IE leverages Microsoft OS engine. In terms of consistency, out of the box FF provides visual cursor movement while IE provides logical cursor movement (the difference is in the behavior or cursor - not necessarily in the Bidi engine functionality which underpins cursor movement). There are some edge cases (specific text patterns / examples) for which different Bidi engines can produce different results (different display of the same text on the screen). However by now those are marginal at least for main leading Bidi engines provided by OSs.

rxaviers commented 9 years ago
  1. display of "HELLO WORLD !!!" with LTR direction will result in: "DLROW OLLEH !!!" while display with RTL direction will result in "!!! DLROW OLLEH".

Considering the caps lock text is Arabic (or other RTL script) I guess? :)

tomerm commented 9 years ago

Correct :).

rxaviers commented 9 years ago

Ok, thanks.

Feel free to submit a draft implementation in a new PR, so we have a better feeling of it. Please, just let me know if there's anything I can help you with the project.

Other team members feel free to weigh in.

PS:

FF uses its own Bidi engine while IE leverages Microsoft OS engine. In terms of consistency, out of the box FF provides visual cursor movement while IE provides logical cursor movement (the difference is in the behavior or cursor - not necessarily in the Bidi engine functionality which underpins cursor movement). There are some edge cases (specific text patterns / examples) for which different Bidi engines can produce different results (different display of the same text on the screen). However by now those are marginal at least for main leading Bidi engines provided by OSs

Ok, I believe it's important that this go into a document with precise details of all features support vs. all browsers (plus different JavaScript environments like Node.js). After all, this is the motivation / reason for all ongoing work made here.

Bidi engine leverages Unicode data not CLDR. By Unicode data I refer for example to Bidi directionality property well defined in Unicode for each character (for all languages).

How often does this data gets updated? What happens on new data? Consider making Unicode data a peer dependency.

tomerm commented 9 years ago

Unicode data is being updated may be several times a year. But it has very marginal if at all importance. The information we are going to leverage does not change. a-z, A-Z from English alphabet will remain strong LTR characters forever. The only kind of update which can affect us is adding a new language / letter to the Unicode. Obviously this happens very rarely if at all.

Unicode data is published in some sort of text format publicly: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt . Specific data which is relevant for Bidi engine is available here: http://www.unicode.org/Public/UNIDATA/extracted/DerivedBidiClass.txt . Usually there is a script which allows extraction of relevant potion of this data. I presume the same way CLDR JSON is being analyzed as well (there is a special package which allows extraction of necessary information). Something similar can be done for Unicode data.

ashensis commented 8 years ago

Rafael. Before approachi8ng this task I would like to clarify a few issues.

  1. According to what has been said above, it is desirable to include Bidi engine as separate module that may be optionally included/used. To my understanding it should be a part of Globalize, correct? In my locale mock-up I created it in the root: scr\bidi-engine.js
  2. The discussion above mentioned the Unicode data that may be found, for example under following public domain (and few others in html format): http://www.unicode.org/Public/UNIDATA/extracted/DerivedBidiClass.txt This data contains character classification to be used by Bidi engine. This date isn't supposed to be changed ever, at least from Bidi related perspective i. e. from the perspective of character classification. My question is how, in your opinion, this data should be integrated/preprocessed. I contemplated it just having been preprocessed once outside git build tree in order to ready-to-use character classification tables in JS format that would be integrated (again once) into Bidi engin JS module (or its dependencies). What do you think about this approach>
rxaviers commented 8 years ago
  1. According to what has been said above, it is desirable to include Bidi engine as separate module that may be optionally included/used.

Yeap, it should be optional (similar to other modules, e.g., date module, number module, currency module, etc).

To my understanding it should be a part of Globalize, correct?

It definitely could. The PR you are considering to send and any additional motivation information will help to make things clear and will help to get this new feature into the project. To be clear, there are motivations highlighted above, but they could be improved, for example: which are the inconsistencies between browsers? Any examples demonstrating them? So on...

In my locale mock-up I created it in the root: scr\bidi-engine.js

Seems good.

  1. The discussion above mentioned the Unicode data that may be found, for example under following public domain (and few others in html format): http://www.unicode.org/Public/UNIDATA/extracted/DerivedBidiClass.txt This data contains character classification to be used by Bidi engine. This date isn't supposed to be changed ever, at least from Bidi related perspective i. e. from the perspective of character classification. My question is how, in your opinion, this data should be integrated/preprocessed. I contemplated it just having been preprocessed once outside git build tree in order to ready-to-use character classification tables in JS format that would be integrated (again once) into Bidi engin JS module (or its dependencies). What do you think about this approach>

Given it will rarely change, let's see how big this data is. Is it locale-dependent. I mean, do you need different set of data for different locales? If not, you may have noticed we have a similar case in our project: The currency formatter needs support for matching [:^S:] regexp (unicode category S), which isn't supported by JavaScript, so we have https://github.com/jquery/globalize/blob/master/src/util/regexp/not-s.js to support it. Note, there's an automated process documented in source comments. Ideally, there should be an automated process (e.g., npm or grunt task to update that for newer Unicode releases). I suggest we start with something similar.

ashensis commented 8 years ago

Thank you Rafael for pointing me out on this, still I didn't understand the process, yet. I see that some of Globalize modules were generated from unicode-7.0.0 using regenerate. But they seem to be the product of some separate build/generation process based on unicode-7.0.0 project. But it is unclear to me how did these modules find their way into Globalize Indeed, for my case, I could glean the relevant information containing Unicode characters categories in some usable form (be it regexp or array of code points) but I can't fathom how does this fit to Globalize build process. My guess is that one have just to take the result of what he cooks up in unicode-7.0.0 and transfer it to Globalize. If this is the case, what is the point of doing this if this process is performed only once or at random voluntary periods (when?)

ashensis commented 8 years ago

I see no 'Automation' whatsoever in this approach, but apparently i am missing some point

rxaviers commented 8 years ago

@ashensis sorry for the confusion. The source comment of src/util/regexp/not-s.js has a process for generating the regexp systematically, but it's not automated. Ideally, we should automate this process using an npm or grunt task. I suggest you to try to use a similar process for now. Later we could automate the existing regexp/not-s.js and the one you are to create. I'm considering they are similar things...

ashensis commented 8 years ago

Thank you so much for comprehensive guidance, I got the respective pull request submitted: https://github.com/jquery/globalize/pull/570

tomerm commented 8 years ago

@rxaviers did you have a chance to review the PR ? Thx.

rxaviers commented 8 years ago

Hi, not yet sorry for the delay. Will do no longer than this coming week.

rxaviers commented 8 years ago

I've added comments on the PR.

rxaviers commented 8 years ago

Ok, I believe it's important that this go into a document with precise details of all features support vs. all browsers (plus different JavaScript environments like Node.js). After all, this is the motivation / reason for all ongoing work made here.

@tomerm please, did you have a chance to elaborate this? Throughout this issue you cite differences between browsers that are motivation for this feature, please could you list them for a newbie to understand? We don't need to go deep into a matrix of every browser versions, but listing all the cross compatibility problems you found between browsers and having one example of each is good.

Thanks

tomerm commented 8 years ago

@rxaviers some examples:

  1. Not all browsers are fully compliant with UBA (especially keeping in mind that it keeps changing). Thus there are edge cases (examples with very specific text patterns) which can be used to illustrate those differences.
  2. Cursor movement - in some browsers it is logical (IE) in others it is visual (i.e. FF), in some it can be configured (FF) in others it is hardcoded (IE).
  3. Numeric shaping - in some browsers it is configured (FF) in others hardcoded (IE).

For handling all use cases mentioned above Bidi engine can be used.

rxaviers commented 8 years ago

Closed by https://github.com/jquery/globalize/pull/570#issuecomment-233611645