translate / pootle

Online translation tool
http://pootle.translatehouse.org
GNU General Public License v3.0
1.49k stars 287 forks source link

Escaping backslash #3941

Closed khaledhosny closed 8 years ago

khaledhosny commented 9 years ago

It seems there is a problem with escaping backslash in translations, when I enter \ submit, then re-open the unit it gets converted into \\. I’m not sure if this is the intended behaviour, but if yes then it is confusing and I don’t think it is a good idea to expose escaping to the translators, it should be handled transparently.

khaledhosny commented 9 years ago

See http://mozilla.locamotion.org/ar/firefox/translate/browser/chrome/browser/browser.properties.po#unit=13367660, for an example.

iafan commented 8 years ago

@julen, let's consider this bug a priority one. See this note for additional context.

julen commented 8 years ago

@iafan mind adding the additional info right here so the reference doesn't get lost in the future?

iafan commented 8 years ago

Ok, here you go:

This is how the string looks like in the resources (single backslashes): 1

This is how it looks like in the .po file (all backslashes are properly escaped as \\): 2

And this is how it looks like in the Pootle UI (see that backslashes are displayed as \\, while they really should be displayed as \). 3

I changed the translation to: \' and submitted it, but when I open the unit, I see this (in the translation, the text is displayed as \\'): 4

julen commented 8 years ago

After investigation, I'd say we are in front of two different bugs, one of data mangling (this report), and another one of incorrect display of escapes.

For the issue concerning this report, submitting a single backlash gets doubled. I've been able to track this down and identify it gets doubled in Django's request cycle, where the QueryDict object for request.POST is built. It calls urllib.parse_qsl, which in turn uses urllib.unquote. Unquoting doubles the backslash, and this is then saved to the unit.

To be fair I'm not entirely sure if the front-end/back-end sides should be doing something else here; I was under the impression such cases where handled by the framework. Any thoughts on the matter?

For the issue about displaying escapes, I have filed #4165.

julen commented 8 years ago

Still didn't figure out the root cause, however here's more specific data on the current state of things:

user input textarea field (Python) DB
\ \\ \
\\ \\ \
\\\ \\\\ \\
\\\\ \\\\ \\
dwaynebailey commented 8 years ago

I think with \ we've generally had a problem, which might or might not be related to the above issue. That is when a user enters a single \ we are not sure if that is their intention. I.e. should we take \ and escape it as \\. Or is the \ actually the start of an escape sequence \n, \t, etc. And in the case where a user only wants a single \ and they follow that with a letter that is part of a valid escape sequence, but which they don't want, what do they do?

My gut feeling would be that the DB should represent what we believe things should be and that serializing should sort things out. But I'm not sure if we get tripped up with the difference between \ and \\.

Looking at Virtaal as a user case the main idea was to hide escaping from users. To replicate that in Pootle we shouldn't have people worrying about \ vs \\ they should just be able to type the character that they want i.e. \.

The problem we hit when we don't escape is the potential for a user (however remote) to enter a valid escape unintentionally. The way Virtaal dealt with this was to use the ⏎ and → for return and tab so we were explicit. But I'm not sure if we can do that in a textarea without major coding changes.

I'm not sure if I'm helping to resolve this here though. Maybe the first step is to make the UI and DB consistent.

julen commented 8 years ago

You can entirely disregard my previous comments, as my understanding when debugging strings was incorrect (PEBKAC): I was interpreting the output of repr() as the actual string value.

>>> from urllib import unquote
>>> value = unquote('foo \ bar')
>>> value
'foo \\ bar'
>>> print repr(value)
'foo \\ bar'
>>> print value
foo \ bar
iafan commented 8 years ago

[...] in Pootle we shouldn't have people worrying about \ vs \ they should just be able to type the character that they want i.e. .

Exactly. We need to always treat user input as is, without trying to escape/unescape it in DB. What users sees in the translation UI is what needs to be in the DB. Of course, we need to do proper escaping/unescaping when dealing with underlying files (e.g. .po), but that part needs to be completely transparent.

The problem we hit when we don't escape is the potential for a user (however remote) to enter a valid escape unintentionally.

This is not Pootle's problem. Ideally, when dealing with a specific file format, the converter needs to deal with escaping/unescaping. But if for some reason it is desired to translate the raw (not unescaped) string, then the translation also needs to be treated as a raw one, and translators will be following the escaping rules exposed in the source string.

julen commented 8 years ago

The core of the issue described here comes from the fact that \ is doubled when the text value is provided for the input textarea. This also happens for other special characters such as \n, \r, and \t.

We were discussing this briefly with @iafan and were mentioning something along the lines that he already commented:

Note currently UI highlighting filters also have their piece of cake here, as they not only highlight special markers, but also transform the output text (e.g. \n is converted to <span ...>\\n</span><br />), effectively creating an artificial gap between how translators visually see the string and how the actual text is.

We believe any transformations like these are unnecessary, and need to go away, so the UI only displays the actual DB value. (I must say I tried removing any filters before but the results were not satisfactory. I'll need to double-/triple-check that again.)

It'd be good if we can confirm we are all on the same page here. Ultimately, the goal is to make it simple for translators to input the actual text, reducing the chance for surprises or accidentally incurring on errors, and as this issue displays, we are not there yet.

dwaynebailey commented 8 years ago

@julen I think we're on the same page. Though I'd like to highlight some concerns.

I think the summary is this: Format <- any format specific escaping needs -> DB <-> UI

With regards fancy transforms, I don't agree that doing the \n<br> adds no value, it give the translator a little more clues about how a string will flow. It worked in Virtaal to reduce confusion and add more value. But I think it adds a layer of confusion in trying to clean this up. So stripping these out and just showing the string as it is :+1:.

My concern is that the DB will have its own escaping concerns for string i.e. we never actualy display the DB value, we unescape the real DB value. Most are not an issue and can be roundtripped quite easily. My concerns centre around \ and it potential confusion as an escape like \n and how we handle those. But that concern doesn't impact my agreement with the general idea.

julen commented 8 years ago

The summary reasons out well. Note we currently do some escaping for the DB value, however this causes issues like the one we are trying to address here.

I've been playing around code and have something that provides the expected input/output in the textarea and DB — it presents some other caveats and questions though. I'll tidy it up a bit and open up a RFC PR soon with that to see where we can go from there. Related to this, and talking about the fancy transform, I think it's fine to visually display a new line (<br>) when there's an actual new line in the DB as well, but having a \\n as a marker as we currently have is just confusing. We could probably use any other type of marker as those showcased in #3350.

unho commented 8 years ago

Perhaps related #3869.

julen commented 8 years ago

I just put up some code for testing/comments in the PR linked above — it's not intended to be merged, so just treat it as a showcase of the taken direction.

There's still plenty of work to be done:

iafan commented 8 years ago

I promised @julen to comment on displaying of non-printable characters: since we want to display the raw data from the database, this data may contain some unprintable characters (hard line breaks are the most common ones, but there can also be tabulation symbols, non-breakable spaces and something else). I think we need to display them in the source string, but in the way that doesn't prevent people from copy-pasting text from source to target.

3350 is definitely something that comes to my mind. Here's the demo that implements the same rendering approach to non-printable symbols (since we're rendering this in the context of HTML, we don't need a special font here):

http://jsfiddle.net/iafan/w4tpu6q6/2/

dwaynebailey commented 8 years ago

@julen I managed to grab some time to test this. I made a PO files with some potential escapes. I have the following observations:

I didn't test any input of strings.

@iafan thanks for the jsfiddle example. It seems the easy part is displaying the source string. I'm not 100% sure I like it being textual, I'd rather have icons or unicode chars to show these markers. The issues I see are these two:

  1. You need to be able to copy the special characters from source to target
  2. You can't see the special characters in the target. With the user manually entering \t and \n at least they could actually see things. So without visual clues you don't know what character you put there or even if it is a special character e.g. nbsp.

Trust that gives some enlightenment.

iafan commented 8 years ago

I'm not 100% sure I like it being textual, I'd rather have icons or unicode chars to show these markers.

Unfortunately, there are not so many special unprintable characters that you can represent using special Unicode characters in a way that the user will understand (Enter, non-breakable space, tab). I want to have a mechanism that would work for any kinds of unprintable characters first; if at some point we decide to have a graphical representation for certain common characters, then we can always add that on top of the existing mechanism.

You need to be able to copy the special characters from source to target You can't see the special characters in the target.

Yes, absolutely. These characters are copied now on manual copy-paste, but this case needs to be combined with approach from #3350 to display such characters in the editor. Also, nothing prevents us from having all unprintable characters to work as 'placeables' (clickable targets that allow to quickly insert their values into the editor).

dwaynebailey commented 8 years ago

@iafan Re getting basic right in text and adding icons later as/if needed :+1:

I couldn't replicate manual copy and paste, it didn't work for me in the jsfiddle example. But if that's known then its just an issue in the demo. But it seems this is on you radar so I'm good with that.

iafan commented 8 years ago

Looks like copy-pasting special symbols doesn't work in Firefox; it does in Chrome.

FWIW, this is how the visual representation of certain characters could look like: http://jsfiddle.net/iafan/w4tpu6q6/

dwaynebailey commented 8 years ago

@julen some quick observations from your last push. Some might be things not implemented or that you are already aware of. I might be repeating myself as I can't remember all I wrote in the last comment and would rather give you a fresh view.

julen commented 8 years ago

Thank you for the input @dwaynebailey — note the PR was still rough around the edges, I only made a push of my local changes the other day so I could set it up on our staging server. Now I have pushed more changes, unifying all highlighting logic in a single place (fixes #4165 basically).

Because of this latest change, the diff highlighting has currently the same behavior in TM matches and user suggestions, therefore displaying the part that was removed and the part that was added. Unless I'm checking it wrong, user suggestions previously displayed only the part that was added. Based on the feedback, I'm happy to make that behavior adjust selectively.

LF is definitely shown in source texts. In fact, and as already mentioned, all the highlighting logic is now shared. It might be that you were looking at a cached rendering of a editing unit, therefore you were getting a stale rendering for the unit.

Regarding the target text, I'm afraid we won't be able to display any markers or anything fancy as long as we use regular textarea elements. This is a whole topic on its own, and I'd leave it aside from this issue.

I have replaced the red squiggles in favor of subtle open boxes, and multiple leading/intermediate/trailing spaces are displayed in such fashion. NBSP is highlighted as well. _(NBSP becoming SP is an optical illusion in Firefox (check bug 359303; I've been bitten by this too and spent a lot of time debugging), however NBSP ends up properly in the textarea.)_

Pending:

julen commented 8 years ago

In order to avoid this from getting stalled, I'd love to hear your thoughts on the current state of things @iafan @dwaynebailey.

iafan commented 8 years ago

Figuring out how to do copying properly.

As I was already mentioning above, having #3350 (special font) in place would allow us to implement copy-pasting properly. Here's how it would work:

  1. We have a font which has certain symbols defined in the Private Use Area
  2. We get raw string from the DB and map known symbols (line feed, tab, nbsp, etc.) to the corresponding Unicode symbols our font supports. Then we display this string in both source text and in the editor, both having the support for this font. This allows us to see and copy-paste these special symbols as any other regular symbols.
  3. Before saving the translation to the database, we do the reverse conversion of the private Unicode symbols to the real ones.

Inputting an actual tab using the keyboard's physical tab key.

I don't think this is necessary, as this will break navigation, and navigation is more important. Tab is rarely, if ever, used in translatable strings, and if it's there, one would be able to select, copy and paste it as a regular visible symbol if we have this custom font in place.

dwaynebailey commented 8 years ago

Quick test

dwaynebailey commented 8 years ago

@iafan

Special font approach. I think I like that and it should work for us. My only concern might be RTL and its impact there, as I'm not sure how rendering engines see the private use space area in terms of meta information like punctuation, directionality, etc. If we can match closely to the original character we're faking then it should work.

Typing tab....

I think this ignore the general issue that you can't use the keyboard to enter characters. LF likely works for us as we can press enter. But TAB and perhaps others ar enot possible. Reverting to hey just click ignores the advantage of speed that using a keyboard gives a translator.

I'm not sure how to solve this issue. I'm happy to take the direction of the user must click on the symbol for now but I'd like us to agree that this isn't ideal for speed.

If the editor had autocomplete/suggest we may be able to get around this by allowing \ to bring up a list of possible special escapes. But I don't want to force that issue into this one.

khaledhosny commented 8 years ago

Using PUA for BiDi neutral characters like spaces is likely to break BiDi badly as all PUA regions AFAIK are classified as strong LTR characters. But may be the bidi neutral effect can be faked using BiDi isolates, though I’m not sure how widely supported they are in browsers since it is a rather recent addition to the BiDi algorithm.

iafan commented 8 years ago

I'll throw together some live demo of the font approach for everybody to try. RTL is definitely the biggest potential challenge with this approach.

iafan commented 8 years ago

Also: there are some symbols in PUA that are considered BiDi-neutral, see e.g. http://www.kreativekorp.com/charset/unicode.php?char=F8FF

So we might just need to carefully select mappings for the font symbols in PUA range.

unho commented 8 years ago

Can't we consider these special characters to be placeables?

iafan commented 8 years ago

Can't we consider these special characters to be placeables?

Sure, why not. This is an independent thing, though.

iafan commented 8 years ago

I uploaded test files here: https://www.dropbox.com/s/hcrjqsa5bge7bnu/raw_font_test.zip?dl=1

The archive contains the font, the test html page and README for Firefox users (they may require tweaking config settings to make TTF load locally; this won't be an issue for the web app).

Try playing around with editing text in these textareas, copy-pasting it, etc. There are some tests for RTL rendering, but from what I can tell rendering looks kind of good by default, and trying to make all special characters treated as RTL doesn't improve things much.

Would love to hear your feedback.

dwaynebailey commented 8 years ago

@iafan had a look at this, seems to work fine. I can't say much for the RTL stuff though, I think we need @khaledhosny's input for that.

Testing

Some observations

iafan commented 8 years ago

@dwaynebailey I should have mentioned that this is just a static demo, with no JavaScript-based processing. It shows a) how the font is rendered (including RTL), and b) how you can copy-paste those symbols as a regular text.

On top of that, there needs to be some JS logic that will map invisible symbols to visible ones, and this logic should kick in on any change of the value. This way, when one presses Enter, or pastes some text into the textarea, it would properly adjust the display of such special symbols.

Using colour in the PUA chars could help (but I realise that makes rendering harder). Just would help to deemphasise special chars.

We can do special char highlighting using syntax highlighters from CodeMirror. CodeMirror syntax highlighting can be used both in a textarea mode and to highlight static text (we can use this to highlight source).

iafan commented 8 years ago

With things like ZWJ and such we probably want to be able to show and hide these much like in a wordprocessor, so that it is possible to read the text without confusion.

This is what the "Raw" mode is about. The idea is to display the majority of these symbols only in Raw mode. It will be only tab, cr and lf symbols that will be always visible.

iafan commented 8 years ago

Because of this latest change, the diff highlighting has currently the same behavior in TM matches and user suggestions, therefore displaying the part that was removed and the part that was added. Unless I'm checking it wrong, user suggestions previously displayed only the part that was added. Based on the feedback, I'm happy to make that behavior adjust selectively.

I missed that one from initial @julen's comment. For user suggestions we also display the full diff (removed/added parts). So the differ rendering is consistent with similar translations.

iafan commented 8 years ago

@khaledhosny any feedback from you?

julen commented 8 years ago

We are resuming work on this, and if there's any RTL-related feedback to add on top of the previous points, we'd love to hear that @khaledhosny. Thank you!

julen commented 8 years ago

But if for some reason it is desired to translate the raw (not unescaped) string, then the translation also needs to be treated as a raw one, and translators will be following the escaping rules exposed in the source string.

So if I didn't miss the point @iafan, this would somehow mean having two editor modes that behave differently:

iafan commented 8 years ago

@julen, this is how these modes are supposed to work:

  1. regular mode: we map only characters which are not related to directionality.
  2. "raw" mode: we map all characters, force LTR and use monospaced font.
julen commented 8 years ago

After reading @dwaynebailey's comment it seems there was some confusion, so I've put up a live demo that showcases how the font would work. @iafan this already includes the last clarifications you made.

Some caveats aside (like copying text to external apps — which can be solved), I think I like the experience so far.

iafan commented 8 years ago

The demo looks good!

One minor amendment from me: we only want to convert spaces to dots in the Raw mode.

iafan commented 8 years ago

Not sure if it's the right time to report any issues with the editing (this is a demo, after all), but:

  1. Ctrl+Space not only inserts space, but moves the caret to the end of the text, which is undesirable.
  2. Pasting any text also moves the caret to the end of the text.
  3. Copy-pasting of the LF character alone doesn't work.
khaledhosny commented 8 years ago

I just tested @julen’s live demo above and a simple two words Arabic string gets its words reordered because of the inserted bullet (word one to the left of word two, not to the right of it).

Generally, I don’t think replacing Unicode characters by other one will work well as far as BiDi algorithm is concerned unless the replacement character has the exact BiDi type of the one it is replacing.

julen commented 8 years ago

@khaledhosny the bullet should not be present in the regular editing mode as @iafan mentioned above. Besides that, I didn't add specify any dir in the demo (was in a rush and wanted to publish it), so probably these points affect proper RTL rendering. I updated the demo page to allow switching from LTR to RTL, it would be nice if you could check it out again.

Re. BiDi, have you tested that to confirm your hunch?

Thanks again!

khaledhosny commented 8 years ago

The base direction (the dir here) should have no effect on the case above, the order of the Arabic words should be the same. Even in “raw” mode, I expect the BiDi algorithm to be applied, otherwise it will be giving misleading results.

As for BiDi, the algorithm is very sensitive it is to character properties. Even characters that would appear to work in an identical way (e.g. an Arabic and a Hebrew letter) can behave differently under certain circumstances (they have different BiDi types for a reason).

Here are some specific issues I found:

I’m not sure what other characters have special treatment to test.

julen commented 8 years ago

@khaledhosny some characters having no effect at all is a mistake from my side in the demo (they were being replaced with the new symbols, instead of being kept along with the symbols). I reckon this should be fixed and the characters should have their effect now.

julen commented 8 years ago

I’m not sure what other characters have special treatment to test.

The demo implements the following mapping extracted from @iafan's test files.

(Unicode) Meaning Dec Hex Type Strength <==> (Font) Symbol Dec Hex BiDi
Null Character 0 0000 BN Weak NULL 57344 E000 ?
Tabulation 9 0009 S Neutral TAB 57353 E009 ?
Line Feed 10 000A B Neutral LF 57354 E00A ?
Carriage Return 13 000D B Neutral CR 57357 E00D ?
Escape 27 001B BN Weak ESC 57371 E01B ?
Space 32 0020 WS Neutral SPACE* 57376 E020 ?
Non-Breaking Space 160 00A0 CS Weak NBSP 57504 E0A0 ?
Others
Zero-Width Space 8203 200B BN Weak ZWS 61451 F00B ?
Zero-Width Non-Joiner 8204 200C BN Weak ZWNJ 61452 F00C ?
Zero-Width Joiner 8205 200D BN Weak ZWJ 61453 F00D ?
Left-to-Right Mark 8206 200E L Strong LRM 61454 F00E ?
Right-to-Left Mark 8207 200F R Strong RLM 61455 F00F ?
Left-to-Right Embedding 8234 202A LRE LRE 61482 F02A ?
Right-to-Left Embedding 8235 202B RLE RLE 61483 F02B ?
Pop-Directional Formatting 8236 202C PDF PDF 61484 F02C ?
Left-to-Right Override 8237 202D LRO LRO 61485 F02D ?
Right-to-Left Override 8238 202E RLO RLO 61486 F02E ?
Word Joiner 8288 2060 BN Weak WJ 61536 F060 ?

Note: @dwaynebailey formatted table and added directionality data.

Legend:

Also useful Wikipedia template for directionality classes

iafan commented 8 years ago

@khaledhosny did you also try the read-only rendering demo I provided some time ago? It should have RTL enabled.

iafan commented 8 years ago

One minor amendment from me: we only want to convert spaces to dots in the Raw mode.

@julen: and yet one more amendment: it would be absolutely fabulous if you could still render spaces as dots at the beginning or the end of the string (hanging spaces).

dwaynebailey commented 8 years ago

So it seems that the issue @khaledhosny has raised isn't being addressed in the PUA in that the replacements don't have the exact same bidi categories. If we had PUA matches that where in the same class as the ones we're substituting then it could work. But it seems that they are all bid neural or something similar.

So some options:

  1. We see this as an LTR solution and do something else for bidi.
  2. We use a font that changes the character as is and don't use PUA. I think once again I need reminding why we needed to use the PUA?