translate / pootle

Online translation tool
http://pootle.translatehouse.org
GNU General Public License v3.0
1.49k stars 287 forks source link

Escaping backslash #3941

Closed khaledhosny closed 8 years ago

khaledhosny commented 9 years ago

It seems there is a problem with escaping backslash in translations, when I enter \ submit, then re-open the unit it gets converted into \\. I’m not sure if this is the intended behaviour, but if yes then it is confusing and I don’t think it is a good idea to expose escaping to the translators, it should be handled transparently.

5y commented 8 years ago

Hi, My result about Persian(RTL language) and Bidi: 1- I Can't Use tab 2- Dot(.) for space works fine and Also we can define sign for ZWNJ I can't find any Sign for it, usefull for RTL language. 3- About NBSP I got follow Result[1]

[1] image

dwaynebailey commented 8 years ago

We currently display \r as [CR] symbol, but don't have a good support for it in terms of editing. I think we should isolate translators from intricacies of platform-dependent newline encoding, and the proper way of dealing with this is at a parser level, not at unit editing (since the type of line endings is a property of an entire file, and all line endings should be encoded consistently throughout the file). In Pootle, all newlines must always be encoded as \n. I suggest moving this into another issue, and not tie to this PR.

@iafan, while I agree we want to isolate translators I think your assumptions are incorrect.

  1. These escapes are actually valid in some localisation formats, we can't pretend that they don't exist.
  2. The parser in some instances is not able to deal with this. When do we convert \n in Pootle to \r\n in the target format? The whole project, some files, the whole file, part of the file, a single unit?
  3. Until the parsers have this sort of contract with Pootle and Pootle is able to manage such contract we can't just drop this.
  4. Not addressing this issue here, makes Pootle undeployable. And the other route of making parsers able to manage the various escapes towards \n needs way more work and time.
iafan commented 8 years ago

@dwaynebailey yes, I understand that some files may come with Windows-specific endings, and Pootle needs to deal with them. I want to say that the only way to properly deal with them is on the file level, not on the unit level. Most of the units will not have a hard line break in the string, so you will never know what to insert in the translation — \r\n or just \n — it is the file (the parser) that only knows it. The way it currently works in Pootle is prone to breaking the final localized file by encoding line breaks inconsistently.

Until this is properly implemented, I believe that what we currently do (display \r as a special visible [CR] box, allow to copy-paste it, save in the unit properly) keeps the level of support at the same level as we have it currently in Pootle. So let's focus on [CR] handling issues, if there are any.

iafan commented 8 years ago

@5y did you type   or did you just copy-paste the [nbsp] symbol from the source string?

I tried to do the latter, and it seems that it works correctly this way. Can you please check?

iafan commented 8 years ago

Also we can define sign for ZWNJ I can't find any Sign for it, usefull for RTL language.

@5y you can insert ZWNJ symbol into your translation (there's a set of buttons to insert such such symbols beneath the textarea), but it is supposed to be visible only in Raw View. If you toggle the 'Raw View' button (the button next to 'Show Tineline' one), you will see all your special symbols displayed in there.

dwaynebailey commented 8 years ago

@iafan I'm afraid you are making the assumption that these escapes will be consistent through the file. There is no guarantee that this will be true.

I'm really not sure what you mean by Pootle encoding the break inconsistently. Its behaviour currently is very consistent.

Doing \r as [CR] displays now, but there are some problems that I've reported previously. Doing \r as a single [CR][LF] seems like a more sensible solution to me as it forces translators to be consistent with what the unit expects in terms of newlines.

iafan commented 8 years ago

But why do you assume that \r\n will always stick together as a sequence? Technically one can use \r alone as some sort of separator. Currently Pootle does allow you to use \r and \n independently. If we make an assumption that \r\n should always go together, then the new editor will not support some cases the current editor does.

Also, if line breaks are unit-specific, how does Pootle know what sequence to use if one wants to use an explicit line break in the translation? How is Pootle currently consistent about this? I can easily add any sequence of \r and \n while translating the string.

dwaynebailey commented 8 years ago

@iafan my assumption of \r\n is based on seeing this in the wild. While you are right it can occur anywhere, I've never seen it. If the aim is to make translators lives easier I would suggest that \r\n be seen as one character, that does not preclude a single \r being treated as such. But I'm not attached to this solution, it's just that it makes it very clear what line ending is required and it's a single click.

The last point is a perennial problem. And I think you miss my point that there is the real likelihood that a file has multiple newline styles, so the parsers and file formats can't always know what this should be. HTML can be snuck into JS. Windows messages into Unix PO files. So we are able to deal with the cases where the newline already exists, but we can't really deal with cases where one does not exist. And yes they're likely consistent across a file.... except when they aren't.

The bottom line for me is this, I'm not sure we can solve that, even if we do it's not part of this problem. But \r\n is a part of this problem and we need to solve it as part of this issue.

iafan commented 8 years ago

I was proposing a pretty straightforward fix: unconditionally replace \r\n with \n. Such normalization doesn't break anything, simplifies the workflow and will protect developers from hard-to-catch issues with files having mixed line endings. This is what we've been doing internally for many years without any single issue.

We can hack the editor to do the following:

  1. detect if there's a \r\n sequence in the source string. If there is, raise the hasCRLFEndings flag.
  2. replace all \r\n with \n before editing. Users will see linebreaks as simple [LF] symbols (in other words, they will be isolated from the differences in line endings, and the editor will work as expected.
  3. Before submitting the translation, if hasCRLFEndings flag is raised, convert all lone \n to \r\n.

So in the output units, their line endings will be preserved to our best knowledge, but in Pootle people will never deal with different kinds of line endings.

dwaynebailey commented 8 years ago

@iafan simply replacing \r\n with \n is not a straightforward fix. Lets not confuse your specific workflow with the general Pootle user's workflow in which the format does not know about such escaping plans and where both are valid in the format.

I'm all for unifying the info to localisers. I'm just not 100% sure why we're trying to solve this issue here and now. This PR already has enough scope, I really don't think we want to scope creep here.

I can see your proposal working for most cases I've seen. But I don't think this is the right time and place to land such a fix that could break other things.

But please, we've just asked for feedback from a number of users. This discussion shouldn't be clouding this issue here and now.

5y commented 8 years ago

@iafan I just type   , Yes and what about the tab?

iafan commented 8 years ago

@5y thanks. We'll look into this. I don't think there's a way to insert tabs in Pootle (but if the source string contains a tabulation symbol, you can copy-paste it). We could think of the alternative shortcut to insert tabs, because just using Tab key is better be reserved for navigation between controls.

dwaynebailey commented 8 years ago

@iafan should we at least make the raw font characters clickable? Inserting things like NBSP and TAB via copy-paste doesn't seem ideal. We allow clicking on special non-ascii character, I wonder if we can duplicate that functionality for these?

julen commented 8 years ago

@5y I'm typing a non-breaking space with the physical keyboard, using alt+space in Mac, along with some text in Arabic script and I can properly see the NBSP symbol both in the textarea and the suggestion box. I tried in Chrome, Safari and Firefox and I cannot reproduce the problem. It'd be good to know which steps you followed so we can diagnose what happened.

julen commented 8 years ago

@julen one new regression I spotted is that in read-only units (table rows other than the unit editor) we started to visually emphasize placeables, and because of the increased cell height it now shows a vertical scrollbar. See how em dash ("—") is rendered in the rows on this page: https://translate.stage.evernote.com/ru/android_evernote/translate/#search=%E2%80%94&sfields=source,target I think we don't need to render certain parts of the string as placeables in such rows.

This should now be fixed; you can use the previous link to validate it.

dwaynebailey commented 8 years ago

@julen I notice that the 'previous translations' are now side-by-side instead of stacked as it was previously. As in https://translate.stage.evernote.com/ru/android_evernote/translate/#search=%E2%80%94&sfields=source,target

Was that intended?

iafan commented 8 years ago

See also a related issue: https://github.com/translate/pootle/issues/4739 (undo/redo stack doesn't work properly when we insert placeables) which becomes more prominent with our new editor.

julen commented 8 years ago

@dwaynebailey it's partly intended and it's been around since the very early stages (and even validated, it seems). I was under the impression I provided some comment about it, although I see I didn't go into the details of this specific change as the commit did.

Since new lines characters incur in line breaks in HTML, and taking into account the previous layout displays texts inline, the result would be garbled, that's why source/target texts need to be displayed as block-level elements.

While on it, the vertical split was chosen based on observing results for a while, considering it offered a clearer picture of the match.

Another option of course is to split blocks horizontally, which would vertically align texts, however this would also leave translated texts at the bottom, farther away from the translator's eyes, even requiring to scroll in some cases, and adds up more vertical space for short texts.

Feel free to play with results by adjusting the CSS declaration for .suggestion-wrapper and adding the flex-direction: column; rule to it, so you get to feel the horizontal split. I don't have strong feelings and I'm happy to switch this to the horizontal split that's what we all prefer, note however that either way there needs to be a compromise.

dwaynebailey commented 8 years ago

@julen it's easy to miss and I fear my short string test file would have masked these anyway. Some aspects of my review are only happening now as the PR stabilizes.

I'm not attached to any of the renderings, though I've gotten used to above/below. I suspect that your comment re left/right, that "it offered a clearer picture of the match", is correct in that all the info on the left is about matches so you can parse that easily without it being clouded by the translation.

I couldn't figure out the test instructions I'm afraid. So for now assume I'm OK with left/right based on your comment.

julen commented 8 years ago

In order to test the horizontal split rendering you need to edit editor.css (either on the FS or directly in your browser) and add flex-direction: column; at the end of the .suggestion-wrapper section.

dwaynebailey commented 8 years ago

@julen thanks for the pointers, managed to change that on the staging server. I concur, two columns makes it much easier to see the differences of all the suggested units from the source text, making it easier to make a selection.

khaledhosny commented 8 years ago

I’m testing on https://translate.stage.evernote.com/ar and the first thing I noticed is that the text area is left aligned, is this intentional?

khaledhosny commented 8 years ago

It seems that not only the text is aligned left, but also the direction is ltr which breaks the order of any text with mixed directionality.

iafan commented 8 years ago

@julen, @khaledhosny is right: directionality is broken in the current PR. Looks like a regression.

julen commented 8 years ago

Thanks for the heads-up folks, the context required to retrieve the language information was lost in the last commit and this is now fixed.

khaledhosny commented 8 years ago

OK, looks good now. Doing more testing I found that <RTL><tab><numbers> is rendered as <RTL><numbers><tab>, using e.g. واحد ١٢٢٢٢: 2016-05-23 12-19-14

khaledhosny commented 8 years ago

The same with NBSP as well.

julen commented 8 years ago

Thanks for testing again @khaledhosny. @iafan we will need to do the same we did with new lines: find a spare character with the same strength as the actual character being replaced and use that in our raw font for the symbols. This applies to the symbols displayed in regular mode, which at the moment are NULL (\u0000), TAB (\u0009), ESC (\u001b), NBSP (\u00a0) and LF (\u000a, already covered). For specific details, please refer to the character table in this same thread.

dwaynebailey commented 8 years ago

This site http://www.sql-und-xml.de/unicode-database/#bidi-eigenschaften might help in terms of finding Unicode characters that are in the same class.

Someone might need to remind me why we can't just draw glyphs for the real code points :/

iafan commented 8 years ago

Someone might need to remind me why we can't just draw glyphs for the real code points :/

From my prior experiments, browsers (or Browser+OS-level font rendering libraries) have strong feeling about certain characters, and don't allow to render any arbitrary glyphs in place of them. Firefox is the most tricky browser in that respect, as many glyphs, being rendered properly in Chrome or IE, are not displayed in FF.

julen commented 8 years ago

@iafan provided me an updated font which remaps the control characters in regular mode to their control picture counterparts, which have the same strength as the original characters, and the code has been updated to use that. With this as I see the issue with TAB (and NBSP etc.) reported earlier is now fixed.

(Note: if you are using Firefox and are experiencing problems when testing the font, you might need to restart the browser — we have observed it caches fonts very aggressively.)

dwaynebailey commented 8 years ago

Thanks @julen and @iafan

Doing some clicky testing.

@julen not sure if Undo stack is part of this? If not. Doing lots of edits and then undoing make a pretty amazing mess in RTL.

iafan commented 8 years ago

Neither CR+LF handling nor undo stack is a part of this update. Just formal symbol rendering in RTL.

julen commented 8 years ago

I pushed another update which deals with undo/redo (our staging server is up-to-date in case someone wants to give it a shot).

iafan commented 8 years ago

@julen Undo/redo seems to work correctly (including the cases when placeables are inserted).

I noticed some strange behavior with copy-pasting, though: 1) Open https://translate.stage.evernote.com/af/_test_raw_font/translate/test.po#unit=5626935 2) Select the first symbol of the translated text (E), copy, then paste it.

On pasting, the trailing newline will be gone.

julen commented 8 years ago

Thanks for the report @iafan, I hope to have a fix for it soon.

julen commented 8 years ago

Thanks for the report @iafan, I hope to have a fix for it soon.

I spent part of today debugging this issue, and it turns out it happens in Chrome and Safari and only when cutting is used before. I couldn't draw more conclusions unfortunately: for some obscure reason, when the paste happens and the change event is triggered, the textarea in Chrome/Safari's DOM is missing the final \n in its value. If I add a character after \n, I cannot reproduce the issue anymore.

In the mean-time, I also added support for clicking in the raw font symbols.

dwaynebailey commented 8 years ago

@julen did some undo testing and got the some positive results and anomaly as @iafan. Did quite a bit of changing and altering. I got some strange results in RTL, but I was using Latin in RTL, so would prefer an RTL user to comment.

dwaynebailey commented 8 years ago

@julen picked up an issue with clickable font symbols (nice addition though)

  1. Using https://translate.stage.evernote.com/af/_test_raw_font/translate/test.po#unit=5626949&offset=0
  2. Delete and add manually including manual entry on NBSP, works
  3. Delete, type "Foo", click on NBSP to insert, type "bar"
  4. Click submit. "bar" is removed from entry, sticks on that unit, "Submit" button greys out

Hope that allows you to replicate.

julen commented 8 years ago

I believe this last issue you reported is now fixed @dwaynebailey.

dwaynebailey commented 8 years ago

@julen confirmed above issues is fixed

julen commented 8 years ago

Yet another update: now \r\n is handled transparently, normalizing it internally in the client-side code to \n. Translators can still input individual \r characters by using the mouse. Our staging server, as well as the test file, have been updated accordingly.

With this in place, I believe we are close to considering the work here as finalized. As a missing bit, I still haven't figured out the pasting issue reported earlier, and I have the feeling it might be some race condition but I still need to check that again.

dwaynebailey commented 8 years ago

@julen tested on staging and this works from the user side of the equation, can't test the source file.

julen commented 8 years ago

4171 is now merged, so closing this one.