RTL languages can have backwards formulas

liammulh commented 3 years ago

Got an email into phethelp:

I translated the "Build a Molecule" simulation into Persian but I just saved and didn't submit it because I saw it will have some problems with chemical formulas. Enclosed you can find the problem. I think it is due to that Persian is a right to left language like Arabic. Recently, I saw that someone has submitted the translation with that error. Please check this out.

Attached screenshot: rtl-issue

Assigning @ariel-phet for prioritization.

jbphet commented 3 years ago

@arouinfar should also consult on this, since she speaks Persian and has done some translations of PhET sims into this language.

arouinfar commented 3 years ago

Those chemical formulas are absolutely wrong. CO2 should always be CO2, etc. Chemical formulas and chemical symbols are a universal standard, and should never be localized.

ariel-phet commented 3 years ago

@muedli given the popularity of this sim, it would be great to fix this issue. Can you determine if this is a general rosetta problem or if this is sim specific?

liammulh commented 3 years ago

This is a sim-specific problem. Here's a screenshot from Balancing Chemical Equations in Farsi:

I wasn't sure about this when I made this issue, but I'm now fairly certain that Rosetta doesn't have control over the content of formulas. Rosetta can be used to change the order in which the formula is seen, but it can't change the formula itself.

The dev responsible for BAM needs to make sure chemical formulas can't be internationalized.

liammulh commented 3 years ago

Moving this issue to the BAM repo.

liammulh commented 3 years ago

@ariel-phet had this on high priority in the Rosetta repo, so I'm going to put it on high here.

jonathanolson commented 3 years ago

This... looks fixed on master? Running with ?locale=fa locally:

@arouinfar thoughts on what to do?

arouinfar commented 3 years ago

@jonathanolson the subscripts and ordering of elements in a chemical formula look ok in master. However, there are still some serious issues. Coefficients are not being properly handled, and a string like Goal: 4H2 is being rearranged oddly into 4: goal H2. The primary learning goal of the Multiple screen if for students to figure out the meaning of coefficients and how they differ from subscripts. The current localization makes that impossible. The location of the parentheses is also really messed up, but I think that's a recurrent (and annoying) issue with RTL localization.

en	fa

jonathanolson commented 3 years ago

Both of those issues (the separation and the parenthesis) should both be very solvable issues, I'll look into it!

jonathanolson commented 3 years ago

It looks like a combination of two ways embedding marks were missing:

The normal RTL embedding marks were outright missing around the RTL strings. @jbphet I was under the impression that rosetta added RTL marks by default around RTL language strings. Is this incorrect now? I'd like to see how this was possible to end up in our system. @arouinfar if there was anything atypical with the creation of the strings for https://github.com/phetsims/babel/commit/8805b1cc7d0ef64915b088b5471faabe79017851, it would help to know
Interior embedding marks to switch the molecular content to RTL were missing. This could be handled in the simulation (by adding RTL marks around the embedded content), and that SHOULD in general be how we do things. It would warrant more discussion about how we include string information.

So basically:

I've fixed the embedding marks for 3 strings in the above commit, but in a branch. @jbphet can these be moved to master in babel? Should they have "edit" JSON structures associated with them?
Maybe sometime we (@arouinfar and @jbphet) could discuss the current strategy for handling RTL marks.

After fixes:

jonathanolson commented 3 years ago

For reference, a single string change was:

Before: After:

202a: Start LTR mark 202b: Start RTL mark 202c: End (one of the above) mark

I'm lazily using a tool I made at https://jonathanolson.net/shaping/examples/bidi-test.html (where I can copy-paste in any string to see the embedding marks and logical structure easily)

jbphet commented 3 years ago

@jbphet I was under the impression that rosetta added RTL marks by default around RTL language strings. Is this incorrect now?

It is not incorrect that I know of. Rosetta is definitely designed to add the RTL embedding marks. There has been some work on the related file recently, done by @muedli, so I'll ask him to investigate if any of the changes he made may have broken this functionality.

@jbphet can these be moved to master in babel? Should they have "edit" JSON structures associated with them?

You should be able to safely move them to babel, and you do NOT want to have any sort of an edit trail with your name on it, otherwise you'll end up being credited as a translator on the website. I would suggest just making the change to the existing file with no edit log entry. Note that this will not trigger a rebuild of the translated version. If you'd like to try manually triggering a build, there are instructions in https://github.com/phetsims/rosetta/blob/master/doc/admin-guide.md, or you can just ask me or @muedli to do it once you've committed your changes to master.

Maybe sometime we (@arouinfar and @jbphet) could discuss the current strategy for handling RTL marks.

I'd be up for it. I haven't done anything on RTL translation for quite some time, and we haven't had any other reports of problems, so I'd like to better understand why it turned out to be an issue for this particular sim.

I'm assigning this back to @jonathanolson to see my responses continue the dialog if needed, and to @muedli to investigate whether some of the recent changes to translate-sim.js may have inadvertently affected the addition of RTL embedding marks.

arouinfar commented 3 years ago

Looks awesome @jonathanolson!

If there's something that needs more discussion, I'd be happy to join in.

jonathanolson commented 3 years ago

Pushed to master in babel, can we trigger a translation build @muedli?

Additionally, I think it's most likely necessary to wrap embedded LTR values (like the molecular formulas) with embedding marks as a general development practice. It might be good to discuss this briefly sometime @jbphet to get your opinion.

liammulh commented 3 years ago

I need to push a small fix to master before I can use localhost to trigger a build.

Note to Self: https://github.com/phetsims/rosetta/commit/eadd764b7196f7c5925deed68393b06e52101e9a is the commit you're looking for.

liammulh commented 2 years ago

I've been going through old GitHub issues and I realized this one fell off my radar. I triggered two different build requests for BAM in the fa locale around 10 AM. (Shouldn't the new builds be up by now?) The build server logs seem to indicate the builds succeeded. I'm still seeing:

Screenshot from 2022-06-10 10-58-32

when I go to https://phet.colorado.edu/sims/html/build-a-molecule/latest/build-a-molecule_fa.html.

I switched browsers and used a private window, so I'm fairly certain it's not a caching issue.

liammulh commented 2 years ago

@jbphet and I were working on Rosetta 2.0 today and we discovered that Rosetta 1.0 only adds embedding marks when a translation is tested.

Rosetta 1.0 does not add embedding marks when a translation is submitted, to the best of my and @jbphet's knowledge. Perhaps that explains what I was seeing in https://github.com/phetsims/build-a-molecule/issues/220#issuecomment-1152653660?

liammulh commented 2 years ago

@jbphet is going to look into this when he gets a chance.

jbphet commented 2 years ago

I've done some investigation on this and, while I don't have a solution yet, I have some information that should help us work towards one. Here are some notes:

On master and in the published versions of this sim, the Persian (fa) and Saudi Arabian Arabic translations (ar_SA) have backwards chemical formulas, while other RTL translations such as Hebrew (iw) and general Arabic (ar) do not.
One notable difference between the translations where the formulas are reversed and the ones where they are not is that the pattern used to format the string is translated. In the non-problematic translations, no translation was submitted for the collectionSinglePattern string.
As an experiment, I added some debug code where the title of the top collection box was being created and output the strings in hex, and found that in the cases where the formulas are backwards, the pattern string starts with an RTL embedding character (U+202b) and in the cases where they are not reversed, the strings start with a LTR embedding character. There are some more details on this below.
When I look in the babel string translation files, the pattern string collectionSinglePattern does not seem to include the RTL embedding marks, but when I log the string in hex from the running code, it does, but only for RTL languages that include a translation of this string pattern. This makes me think that the embedding marks are being added when the strings are generated, which doesn't seem quite right.
I looked at Rosetta's code and the translated files, and I don't believe that Rosetta is adding RTL embedding marks.

For reference, here is the debug code that I added to the BAM class SingleCollectionNodeBox:

``` Index: js/common/view/SingleCollectionBoxNode.js IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 =================================================================== diff --git a/js/common/view/SingleCollectionBoxNode.js b/js/common/view/SingleCollectionBoxNode.js --- a/js/common/view/SingleCollectionBoxNode.js (revision ebad6c0d6a2cb7ff329f5114c76e57cd1fcf5f61) +++ b/js/common/view/SingleCollectionBoxNode.js (date 1667341979541) @@ -25,6 +25,27 @@ constructor( box, toModelBounds, showDialogCallback ) { super( box, toModelBounds, showDialogCallback ); assert && assert( box.capacity === 1 ); + if ( box.moleculeType.molecularFormula === 'H2O' ) { + console.log( '------------------' ); + console.log( `box.moleculeType.commonName = ${box.moleculeType.commonName}` ); + console.log( `box.moleculeType.molecularFormula = ${box.moleculeType.molecularFormula}` ); + const titleString = StringUtils.fillIn( BuildAMoleculeStrings.collectionSinglePattern, { + general: box.moleculeType.getGeneralFormulaFragment(), + display: box.moleculeType.getDisplayName() + } ); + console.log( `titleString = ${titleString}` ); + const convertToHex = str => { + let hex = ''; + for ( let i = 0; i < str.length; i++ ) { + hex += '' + str.charCodeAt( i ).toString( 16 ) + ' '; + } + return hex; + } + console.log( `convertToHex( box.moleculeType.getGeneralFormulaFragment() ) = ${convertToHex( box.moleculeType.getGeneralFormulaFragment() )}` ); + console.log( `convertToHex( BuildAMoleculeStrings.collectionSinglePattern ) = ${convertToHex( BuildAMoleculeStrings.collectionSinglePattern )}` ); + console.log( `convertToHex( box.moleculeType.getDisplayName() ) = ${convertToHex( box.moleculeType.getDisplayName() )}` ); + console.log( `convertToHex( titleString ) = ${convertToHex( titleString )}` ); + } this.insertChild( 0, new RichText( StringUtils.fillIn( BuildAMoleculeStrings.collectionSinglePattern, { general: box.moleculeType.getGeneralFormulaFragment(), display: box.moleculeType.getDisplayName() ```

And here is the output of the debug code when run on the fa translation, which exhibits the problem:

------------------
SingleCollectionBoxNode.js? [sm]:30 box.moleculeType.commonName = water
SingleCollectionBoxNode.js? [sm]:31 box.moleculeType.molecularFormula = H2O
SingleCollectionBoxNode.js? [sm]:36 titleString = ‫H<sub>2</sub>O (‫آب‬)‬
SingleCollectionBoxNode.js? [sm]:44 convertToHex( box.moleculeType.getGeneralFormulaFragment() ) = 48 3c 73 75 62 3e 32 3c 2f 73 75 62 3e 4f 
SingleCollectionBoxNode.js? [sm]:45 convertToHex( BuildAMoleculeStrings.collectionSinglePattern ) = 202b 7b 7b 67 65 6e 65 72 61 6c 7d 7d a0 28 7b 7b 64 69 73 70 6c 61 79 7d 7d 29 202c 
SingleCollectionBoxNode.js? [sm]:46 convertToHex( box.moleculeType.getDisplayName() ) = 202b 622 628 202c 
SingleCollectionBoxNode.js? [sm]:47 convertToHex( titleString ) = 202b 48 3c 73 75 62 3e 32 3c 2f 73 75 62 3e 4f a0 28 202b 622 628 202c 29 202c

Here is the output from the iw version, which does not have the problem.jj

------------------
SingleCollectionBoxNode.js? [sm]:30 box.moleculeType.commonName = water
SingleCollectionBoxNode.js? [sm]:31 box.moleculeType.molecularFormula = H2O
SingleCollectionBoxNode.js? [sm]:36 titleString = ‪H<sub>2</sub>O (‫מים‬)‬
SingleCollectionBoxNode.js? [sm]:44 convertToHex( box.moleculeType.getGeneralFormulaFragment() ) = 48 3c 73 75 62 3e 32 3c 2f 73 75 62 3e 4f 
SingleCollectionBoxNode.js? [sm]:45 convertToHex( BuildAMoleculeStrings.collectionSinglePattern ) = 202a 7b 7b 67 65 6e 65 72 61 6c 7d 7d 20 28 7b 7b 64 69 73 70 6c 61 79 7d 7d 29 202c 
SingleCollectionBoxNode.js? [sm]:46 convertToHex( box.moleculeType.getDisplayName() ) = 202b 5de 5d9 5dd 202c 
SingleCollectionBoxNode.js? [sm]:47 convertToHex( titleString ) = 202a 48 3c 73 75 62 3e 32 3c 2f 73 75 62 3e 4f 20 28 202b 5de 5d9 5dd 202c 29 202c

As noted above, it looks like the difference is in the RTL embedding marks for the pattern.

jbphet commented 2 years ago

@jonathanolson - Are RTL embedding marks being added in the string generation stage? If so, do you think that not adding them to patterns might fix this issue? It seems to me that it is creating a bit of a "double negative" sort of situation.

Once you've reviewed the notes above and commented, let me know if you'd like to meet to discuss. I'm happy to do so. This is marked as high priority because we'd really like to get it worked out before Rosetta 2.0 is deployed, which is imminent.

arouinfar commented 4 months ago

@JacquiHayes reported that this problem exists in the Dari fa_DA locale, but not the similar Pashto ps locale over in https://github.com/phetsims/joist/issues/973.

jonathanolson commented 4 months ago

Looks like we have a lot of issues related to this. Patched in https://github.com/phetsims/chipper/issues/1355.

I believe that is resolving it the "correct" way (putting explicit LTR marks around embedded LTR content).

jbphet commented 3 months ago

I just looked at the currently published Dari (Persian) version, and it appears that it is still broken. I looked at https://github.com/phetsims/build-a-molecule/issues/220 and found that KP had approved the maintenance release, but hadn't assigned the issue back to @jonathanolson, so I think the decision was missed. I just checked in with @jonathanolson over Slack, and he said he'll move forward with the MR now.

jonathanolson commented 3 months ago

QA test in https://github.com/phetsims/qa/issues/1113.

jonathanolson commented 3 months ago

@arouinfar it looks like other RTL translations aren't ordering things correctly... should I try to fix those up to be ordered like the fa locale?

https://phet-dev.colorado.edu/html/build-a-molecule/1.0.22-rc.2/phet/build-a-molecule_all_phet.html is the link to test (or phettest, I just fixed the second screen also).

arouinfar commented 3 months ago

@arouinfar it looks like other RTL translations aren't ordering things correctly... should I try to fix those up to be ordered like the fa locale?

Yes, I think we should fix this in all RTL languages. The problem is with the coefficients used on the Multiple Screen, not the chemical formulas themselves.

Things are rendered correctly in fa

I tested a few other RTL locales, and things are messed up in different ways.

Pashto, locale=ps

Hebrew, locale=iw

Arabic, locale=ar Looks like the translator left the string in English to avoid layout issues.

jonathanolson commented 3 months ago

I pushed up the MR to production, applied some patches to the strings, and kicked off a translation rebuild.

Locales that should be fixed up (extra query parameter to bypass cache):

However, ar_MA and ps locales have translation strings that seem to "reverse" the first-screen formulas (with the items in parentheses):

@arouinfar is that also something I should patch up? Also, can you check out the above links to see how they look?

arouinfar commented 3 months ago

@jonathanolson I spot-checked all of the links in https://github.com/phetsims/build-a-molecule/issues/220#issuecomment-2231480001 and the chemical formulas and coefficients in front of the formulas all look good.

However, ar_MA and ps locales have translation strings that seem to "reverse" the first-screen formulas (with the items in parentheses):

This looks acceptable to me. If I'm not mistaken the string in parentheses is the name of the chemical, in this case, water. It seems more like the translator's stylistic choice than a bug. I don't think we need to make any further changes.

jonathanolson commented 3 months ago

Sounds good, I think that is everything for this issue, thanks!

phetsims / build-a-molecule

RTL languages can have backwards formulas #220