translate / pootle

Online translation tool
http://pootle.translatehouse.org
GNU General Public License v3.0
1.49k stars 287 forks source link

Escaping backslash #3941

Closed khaledhosny closed 8 years ago

khaledhosny commented 9 years ago

It seems there is a problem with escaping backslash in translations, when I enter \ submit, then re-open the unit it gets converted into \\. I’m not sure if this is the intended behaviour, but if yes then it is confusing and I don’t think it is a good idea to expose escaping to the translators, it should be handled transparently.

iafan commented 8 years ago

@dwaynebailey the demo Julen provided wasn't RTL-aware initially. I guess @khaledhosny just needs to re-test this again. My read-only demo was RTL-aware, and I checked that RTL rendering looks ok, but still it would be good to confirm that with Khaled.

So I believe there are no problems with this approach.

We need to use PUA simply because there's no way to provide glyphs for special characters like newlines or characters to control directionality.

dwaynebailey commented 8 years ago

@iafan RTL-aware how? If you mean by adding directionality I think you are missing Khaled's point about "character properties". Our PUA chars do not have the same directionality as the ones they are replacing therefore the Bidi algorithm does not work as expected.

At least that is my interpretation of @khaledhosny's comment.

iafan commented 8 years ago

It just had the RTL directionality set on a control, and in my tests I didn't see problems with the Arabic tests when spaces were replaced with dots and with "LF" characters being visible. I'd like to hear the input on the tests from https://github.com/translate/pootle/issues/3941#issuecomment-165334843 before we consider something as not working properly.

khaledhosny commented 8 years ago

The Arabic text with dots instead of spaces in test files from https://github.com/translate/pootle/issues/3941#issuecomment-176130738 is completely broken when the direction is LTR. The RTL looks OK, but the tests are very simple (no numbers, essentially no punctuation, etc.) and since it is static I can’t try different test.

Also the forced RTL and LTR tests are completely unintelligible, and I don’t see what they would be good for.

As far as @julen demo goes, the raised points are still applicable. The RLO case is partially fixed, in the since that it has its effect on the text, but the visible symbols are drawn on the wrong side of the text.

iafan commented 8 years ago

@khaledhosny thanks for running the tests.

The RTL looks OK, but the tests are very simple (no numbers, essentially no punctuation, etc.)

Can you provide me some text with numbers/punctuation you would like to test this on? Preferably with the screenshot of the same text so that I could visually compare the results.

Also the forced RTL and LTR tests are completely unintelligible, and I don’t see what they would be good for.

Forced LTR will be used for the "raw" mode. The reason why this should strictly be LTR is because the raw mode is meant to allow everybody preview HTML tags and placeholder names in their unaltered form, and these are LTR by their nature. This mode is not meant to provide readable RTL text.

julen commented 8 years ago

Did some extensive reading on the topic... :book:

As already mentioned by others, PUA characters are defined as BC:L, however it seems this is purely informative. If you read through that thread it's unclear whether the BC for PUA characters is taken into account by OS/rendering engines — in case they do, this will be of type L. So there's the risk that PUA characters could mess things up in RTL texts.

julen commented 8 years ago

@khaledhosny:

Using PUA for BiDi neutral characters like spaces is likely to break BiDi badly as all PUA regions AFAIK are classified as strong LTR characters. But may be the bidi neutral effect can be faked using BiDi isolates, though I’m not sure how widely supported they are in browsers since it is a rather recent addition to the BiDi algorithm.

So if I understood correctly, this would be a matter of wrapping our PUA characters in FSI and PDI (\U2068\UF0xx\U2069) to somehow cancel the L class of the PUA characters so they could be treated as neutral.

As you comment though, this looks to be poorly supported by browsers at the moment. Note the support information is from March 2015, and things might have slightly changed since then. Nevertheless it seems like we are out of luck and I don't see how we could implement the font approach properly for the time being.

iafan commented 8 years ago

@julen I experimented with wrapping PUA characters to force RTL, but @khaledhosny says the result is bad. But even if the rendering was good, adding these extra symbols would break the natural editing flow: caret would be stuck on these unprintable characters, which is not good.

As per my previous comments, in regular mode there are only few characters that need to be visible: hanging spaces, line feeds, and tabs. For these, we can probably additionally hijack some rarely used characters in the Unicode range with R class. But before we do that, I'd like to have some tests that clearly show the problem, and I hope @khaledhosny can help me with that (see https://github.com/translate/pootle/issues/3941#issuecomment-177563481).

Meanwhile, you can simply implement this together with disabling any character replacement for the normal RTL mode, if everybody agrees that this is an option.

julen commented 8 years ago

I experimented with wrapping PUA characters to force RTL, but @khaledhosny says the result is bad. But even if the rendering was good, adding these extra symbols would break the natural editing flow: caret would be stuck on these unprintable characters, which is not good.

I'm not sure I follow, the test files you provided have no isolate wrapping around the PUA characters. Besides, such wrapping doesn't have to make the caret stuck — the implementation is no different from adding, say, CR, where besides the symbol the actual character also needs to be preserved so the symbol actually works.

In any case, as already mentioned these isolates are not widely supported in browsers so it's out of the question for the time being.

While I really like the editing experience with this font in LTR mode, I'm afraid not having proper RTL+BiDi support doesn't make this a viable option unless otherwise demonstrated.

iafan commented 8 years ago

@khaledhosny ping (see https://github.com/translate/pootle/issues/3941#issuecomment-177563481)

julen commented 8 years ago

An update on this: the PR #4171 has been rebased and updated against current master, and support for copying to the clipboard has been fixed too.

From what I can tell, we are left with a couple of issues:

The first one is on my radar at the moment, unfortunately much cannot be ascertained about the second one though. Also, given that the PR makes important changes to how the editing area works, seeing the overlapping with ongoing work on #2596 and taking into account the benefits brought by the changes here (improved reusability), I'd suggest we try to move forward with this now.

khaledhosny commented 8 years ago

I think @julen’s last comment just nailed it, modifying the underlying characters is just incompatible with how bidi works. The kind of thing done here needs to be done in a layer above character processing, but I don’t think the web allows for this (short of doing the text layout yourself and painting everything on canvas or something similar).

julen commented 8 years ago

@khaledhosny thank you very much for your time to reply. Forgive me if my understanding is incorrect, but it remains a bit unclear to me whether you have also considered @iafan's comment: truth is in regular mode only a few characters are depicted by symbols (new line, hanging+non-breaking space, tab; note the rest are displayed in raw mode, which forces LTR) and perhaps some R class characters can be used in RTL for this purpose. It'd be great to shed some light on this, thanks again!

julen commented 8 years ago

I rebased and updated the PR with fixes for the caret position when dealing with CR symbols. The special font mapping has also been adjusted to use an Ogham spacing character (\u1680); this being BiDi neutral should potentially work fine in RTL/BiDi scenarios.

I'm looking forward to hear more on issues, although overall I feel this is in pretty solid condition.

dwaynebailey commented 8 years ago

Did some clicky testing on this (not looking at RTL though)

RTL comment, since the Ogham spacing character is ON (Other Neutral) I suspect it's still not going to do what you want since the Bidi algorithms are basing their actions on the depth of bidi types. Likely ON will not be the same as BN, WS or CS which range from Neutral to Weak strength.

Clicky testing comments:

  1. When you first arrive at a unit the cursor is placed at the start of the line. Which is what we do now, it just seemed rather odd. But fine I guess as we're doing the same thing.
  2. "Three\r\n" misses CR in the source text. Target text renders fine. Edit this is when unit has editor focus, when in the context list it renders fine.
  3. "Three\r\nFour" as above CR is missing in source text, target renders fine.
  4. [SP][NBSP] and [NBSP][SP] the dotted space is missing or unclear
  5. Cursor seems to be different vertical length from the normal text area. Not critical but will be different from the platform and I guess unexpected. Edit realise that this is within text, so I guess it's what we're doing with single space, but I think in this case we should be treating it the same as we do double space.
iafan commented 8 years ago

@dwaynebailey 'OGHAM SPACE MARK' is actually a whitespace character (BiDi class: WS), see e.g. http://www.fileformat.info/info/unicode/char/1680/index.htm so we expect it to behave exactly as a regular space character.

This is it's definition in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt: 1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;

iafan commented 8 years ago

Regarding cursor height: this is already fixed by Julen (but not included in the PR yet). You can try going to https://translate.stage.evernote.com/ and experimenting with the editor there — Julen has patched it to include the fix for the caret.

khaledhosny commented 8 years ago

Testing on https://translate.stage.evernote.com/, the issues I reported before seem to be still there, most importantly:

Also tab is rendered as a long (EM?) dash.

dwaynebailey commented 8 years ago

@iafan thanks that's a clearer unicode link. And Ogham is old so we're unlikely to hit issues with someone actually translating into Ogham.

@iafan yes, the caret look good on the staging server.

@khaledhosny I think the idea of raw mode is that things will be broken in the RTL text. But you will be able to see things like and $variables to check that they are indeed correctly written.

@khaledhosny I think it might help if you are able to give a simple example text and how it should look, I fear without RTL reading ability most of us can't 'see' those types of issues.

iafan commented 8 years ago

@khaledhosny can you please confirm that the trailing/leading whitespace or multiple spaces in the middle of the string, which are rendered as visible dots, have no effect on rendering?

As for the raw mode, it is not meant to edit RTL text. It is meant to preview mixed LTR+RTL content the way it is stored, and to edit tags and placeholders (which are LTR). So this is by design.

We'll see what we can do with the carriage return symbols. Probably will find some similar unused WS symbol.

khaledhosny commented 8 years ago

The current raw mode is usable for Arabic, but the new one is completely unintelligible and I don’t see what much value this uninhabitability brings. And having the text shaped while being forced LTR just adds salt to the injury. If it were unshaped it would have been at least clear that it is bring broken intentionally.

iafan commented 8 years ago

I think for LF symbol, we can replace reqular line feed symbol: 000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

with this one (the control character that bears the same semantics and BiDi class): 0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;

I don't believe it is widely used (especially in localizable strings).

@dwaynebailey @khaledhosny thoughs?

khaledhosny commented 8 years ago

I already gave sample text few comments up.

khaledhosny commented 8 years ago

Leading and trailing spaces are rendered as dashes, as well as more than one space on the middle of text. Single space in the middle is fine.

iafan commented 8 years ago

And having the text shaped

Do you mean here that it is rendered in monospace font?

iafan commented 8 years ago

Leading and trailing spaces are rendered as dashes

Can you please provide a screenshot?

khaledhosny commented 8 years ago

Here is a screenshot of the spaces: 2016-04-27 22-50-59

And here is word in regular mode, Arabic for shoulder: 2016-04-27 22-51-51

And in raw mode it reads kill, destroy: 2016-04-27 22-52-11

If one insists on forced LTR mode (which I’d really hate), at least make the text unshaped: 2016-04-27 22-56-27

iafan commented 8 years ago

@khaledhosny Ok. So it seems that your default font does have a symbol for the Ogham space mark, which is rendered as a dash, and thus doesn't take the glyph from the Raw font that we use (which has it rendered as a dot). But otherwise this symbol (dash) doesn't affect the directionality of the adjoining text, correct?

I understand what you mean by unshaped now, and our intent for Raw mode was to do exactly that — display all raw symbols separately. Not sure how to achieve that easily, though. The only thing that comes to mind is to insert an invisible space (or thin space) between all symbols dynamically.

iafan commented 8 years ago

I also checked with our own developer here who speaks Arabic and ran a demo through him: he has no "dash instead of the dot" issue. So I guess this is specific to a particular system/browser (but I believe we will be able to resolve this specific issue). He also confirmed that this tiny dot doesn't get in the way and doesn't break the flow of the RTL text in any way.

His opinion on Raw mode is that it is not mean to be used by translators (when they are doing the actual translation of the text), and to review placeholders the raw mode is totally fine, even with symbols shaped together. Yes, in some rare circumstances you will get an unintentional meaning (like if you're writing the word "dog", switch to a raw mode and can read the word "god" between other symbols, but that is ok and is pretty rare). So he thinks that trying to break symbol shapes apart would unnecessary complicate the engineering for not much of a benefit. Again, nobody is supposed to be writing text in this mode, because there's a regular mode for that. It's only about reviewing placeholders and special non-printable BiDi control symbols.

As for the LF symbol displayed incorrectly near numbers, we could reproduce this, but he saw that as a pretty minor issue (not something really breaking the flow, and not something we should write robust directionality detection algorithms for).

Just a separate opinion of a native speaker to get the fuller picture.

khaledhosny commented 8 years ago

I don’t think my default font has a glyph for Ogham space (I just checked, it doesn’t).

khaledhosny commented 8 years ago

I think the simple fix is to not force LTR text direction, because current raw mode works just fine and I have been using it probably since it was introduced for the situations like what you just described. So far no one has explained to me why we need forced LTR here.

iafan commented 8 years ago

I don’t think my default font has a glyph for Ogham space (I just checked, it doesn’t).

What OS/browser/version do you use (just for the record)? Does the same issue appear if you use some other browser?

I think the simple fix is to not force LTR text direction, because current raw mode works just fine

It was historically in a semi-implemented state that did render Arabic correctly but not the placeholders (when you have a mix of Arabic and e.g. HTML tags or placeholders with some punctuation marks). Now it will do what it was originally supposed to do, and yes, it will intentionally break RTL flow, and you will now have a choice whether to edit/review the text itself (in the regular mode) or review placeholders (in Raw mode).

khaledhosny commented 8 years ago

Firefox on Linux, no other browsers to test with at the moment.

Do you have a text that does not currently work in raw mode, because I was just using it for that. Without being able to read the words its usability for me it reduced to the simplest of cases, I need to know which words I’m inserting the placeholders between or how can I validate it is the correct place? With no shaping at all I can train myself to read the text backwards, but with shaping this is nearly impossible (for me at least).

iafan commented 8 years ago

Do you have a text that does not currently work in raw mode

Yes, consider this example:

This is how the text is displayed in regular mode (tags appear broken): c

And this is the text is displayed in Raw mode with no forced LTR (still broken): b

This is how the text renders currently in Raw mode with forced LTR (you can verify that tags are not broken): a

With forced LTR, there's also no caret quirkiness — you can predictably select/copy/paste/move mixed LTR+RTL chunks.

This what the Raw mode is about. Switching between Raw and normal mode gives you a degree of freedom when dealing with such strings. Translators are not expected to use Raw mode by default or try to read/review text while in this mode.

khaledhosny commented 8 years ago

Oh, XML. OK I don’t edit much XML during translation to encounter such a case, and though I don’t think forced LTR is the proper fix here (treating XML tags as bidi isolates should be the fix), I understand it is the simplest solution.

To prevent the shaping you can add something like this to the CSS rules:

font-feature-settings: "isol" off, "init" off, "medi" off, "fina" off;
iafan commented 8 years ago

I agree there's a way to display both tags and RTL properly, but this requires some heavy Unicode analysis and will likely require writing a completely custom editing control, which we can't afford doing at the moment.

font-feature-settings: "isol" off, "init" off, "medi" off, "fina" off;

Thanks for the suggestion! I quickly tested this and this is what I've got in Raw mode (just to verify that this is what you would expect): d

We can definitely add this to the CSS styles.

iafan commented 8 years ago

We can also add letter-spacing: 0.1em; to visually separate symbols a bit in this mode: e

Let me know if you feel it looks better this way.

khaledhosny commented 8 years ago

I don’t think the extra spacing is needed in general, looks like Courier New have bad spacing here, bit not all Arabic monospace fonts are like that.

iafan commented 8 years ago

Ok, so we'll go only with font-feature-settings: "isol" off, "init" off, "medi" off, "fina" off; for now, then (cc @julen)

iafan commented 8 years ago

@khaledhosny if you're comfortable with the developer console in your browser, do you mind opening the CSS styles for the editor textarea in Pootle and changing font-family: sans-serif, "Raw"; to font-family: "Raw", sans-serif; and see if this changes your "dash" symbol for spaces to a tiny dot one?

khaledhosny commented 8 years ago

Swapping the order of fonts makes the small dot appear indeed. It still does not appear for single space in the middle of text (no dash either, just the regular space).

iafan commented 8 years ago

That's expected. In regular mode, we will only show dots for duplicate whitespace and leading/trailing spaces.

julen commented 8 years ago

That's expected. In regular mode, we will only show dots for duplicate whitespace and leading/trailing spaces.

To add on top of this, note the small dots replace the current red squiggles, which make it very hard to distinguish the number of spaces they represent.

julen commented 8 years ago

@khaledhosny Ok. So it seems that your default font does have a symbol for the Ogham space mark, which is rendered as a dash, and thus doesn't take the glyph from the Raw font that we use (which has it rendered as a dot).

@iafan if we want to ensure the best experience, we have to specify font families in the "Raw", sans-serif order, otherwise we have the risk of ending up with issues like @khaledhosny ran into.

Having "Raw", sans-serif means we are back with the small caret, however I checked on the topic and this can be solved in the font itself: caret height seems to be a function of font-size, and font-size is influenced by the ascent and descent of the specific font. Since the glyphs in the raw font are quite short compared to the standard sans-serif font glyphs, this results in smaller ascent/descent values, hence the smaller caret height.

So the solution would be (disclaimer: I'm not a font expert) to either adjust the font properties to have higher ascent/descent values, or introduce a glyph which would result in having higher ascent/descent. I tried this with a simple font editor and it worked, so you will want to adjust the font get an acceptable caret size. It might be a good idea to play around by mixing the raw font with different font families, too.

julen commented 8 years ago

By the way @iafan, it would be great to experiment converting the TTF font to WOFF, which from my understanding should offer a better compression gain.

julen commented 8 years ago

I just updated the PR code to include the suggested font-feature-settings properties in raw mode, and have put the font families in "Raw", sans-serif order to prevent potential issues like displaying glyphs from other fonts for our special characters.

Looking forward to the next font update @iafan.

julen commented 8 years ago

Got the requested font update via email from @iafan and the PR has been updated accordingly. Other assorted fixes have been pushed as well.

dwaynebailey commented 8 years ago

@iafan @julen I'd like to invite RTL localisers to have a look at this. But hit the following concerns with staging. We don't have a project that just highlights the special features of the PR, space behaviour, font, etc. You kind of have to find them.

So could we build a demo project that actually highlights:

  1. The workings - I shared my test
  2. Potential problem areas - variables, HTML, etc.

At the moment I fear anyone I point to staging to test, won't even know what to test or how to find good examples. And we really need them to be testing behaviour in good examples to get buyin or to pick up any potential issues.

iafan commented 8 years ago

Bringing essentials from the conversation in Gitter:

  1. We created a small test project to play with: https://translate.stage.evernote.com/projects/_test_raw_font/
  2. We currently display \r as [CR] symbol, but don't have a good support for it in terms of editing. I think we should isolate translators from intricacies of platform-dependent newline encoding, and the proper way of dealing with this is at a parser level, not at unit editing (since the type of line endings is a property of an entire file, and all line endings should be encoded consistently throughout the file). In Pootle, all newlines must always be encoded as \n. I suggest moving this into another issue, and not tie to this PR.
iafan commented 8 years ago

@julen one new regression I spotted is that in read-only units (table rows other than the unit editor) we started to visually emphasize placeables, and because of the increased cell height it now shows a vertical scrollbar. See how em dash ("—") is rendered in the rows on this page: https://translate.stage.evernote.com/ru/android_evernote/translate/#search=%E2%80%94&sfields=source,target

I think we don't need to render certain parts of the string as placeables in such rows.