microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
https://www.typescriptlang.org
Apache License 2.0
101.04k stars 12.49k forks source link

Template string downlevel to ES5 escapes more Unicode characters than necessary #10498

Open shytikov opened 8 years ago

shytikov commented 8 years ago

TypeScript Version: 1.8.10

Typescript compiles template strings with Unicode characters incorrectly.

Code

var test1: string = 'test'; // Normal string (only English)
var test2: string = `test`; // Template string (only English)
var test3: string = 'тест'; // Normal string (Cyrillic)
var test4: string = `тест`; // Template string (Cyrillic)

Expected behavior:

There should be no significant difference in string representation, no matter language they are using. Since this is Unicode anyway. You can see expected behavior on standard, singleline strings. This behavior I would consider normal, expected.

var test1 = 'test'; // Normal string (only English)
var test2 = "test"; // Template string (only English)
var test3 = 'тест'; // Normal string (Cyrillic)
var test4 = "тест"; // Template string (Cyrillic)

Actual behavior:

Multiline (template) strings with none-English characters becomes encoded.

var test1 = 'test'; // Normal string (only English)
var test2 = "test"; // Template string (only English)
var test3 = 'тест'; // Normal string (Cyrillic)
var test4 = "\u0442\u0435\u0441\u0442"; // Template string (Cyrillic)

Link to official playground

RyanCavanaugh commented 8 years ago
var test5 = `типcкрeпт`;
shytikov commented 8 years ago

Swedish characters will also fail:

var test6 = `också`;

The last symbol (å) will be escaped.

shytikov commented 8 years ago

I believe in general template strings handling could be done differently depending on target. For example, for es6 it's not necessary to escape new line character for multi-line template strings, since such a concept already present in the es6. But for es5 target such an escaping is the must.

DanielRosenwasser commented 8 years ago

for es6 it's not necessary to escape new line character for multi-line template strings, since such a concept already present in the es6. But for es5 target such an escaping is the must.

That's what we currently do.

The reason I didn't preserve the original text when I implemented this was to avoid re-scanning the string when performing emit. We basically take the internal textual representation and call something that's basically an augmented JSON.stringify. This is good because it replaces newlines with \n,

However, the function takes a conservative approach and uses a unicode escape if something falls outside of ASCII. We take advantage of this if you ever use an extended unicode escapes in all strings.

For instance, the string "\u{12345} också" will get rewritten to

"\uD808\uDF45  ocks\u00E5"

in ES5 instead of

"\uD808\uDF45  också"

So what you're noticing is that this function does a teensy bit too much.

Also, given the fact that TypeScript can _finally_ assume the existence of JSON.stringify, this fix is probably a LOT easier. :smiley:

JakeGinnivan commented 6 years ago

We have just hit this issue upgrading from 2.4.1 to 2.6.1.

glamorous.h3({
    '&::after': {
        content: `"\\00BB"`
    }
})

Before it output chevrons, after the upgrade it output x000BB.