google / closure-compiler

A JavaScript checker and optimizer.
https://developers.google.com/closure/compiler/
Apache License 2.0
7.41k stars 1.15k forks source link

Closure minification adds \uxxxx escapes into output file, increasing code size #4158

Closed juj closed 8 months ago

juj commented 8 months ago

In https://github.com/emscripten-core/emscripten/pull/21426 we are discussing ways to improve on Base64 encoding of binary WebAssembly Modules embedded inside .js code. It is observed that both gzip and brotli compress Base64 pessimistically.

One observation here is that the UTF-8 standard is well-specified, so we can attempt to embed bytes directly as UTF-8 code points.

Attempting to do so runs into a Closure minification problem however.

Input: ab.zip

function binaryDecode(r) {
  for(var t=0, B=r.length, e=new Uint8Array(B); t<B; ++t) e[t]=r.charCodeAt(t)-1;
  return e;
}

// String with bytes 0x00 - 0xFF embedded in it.
var js = '  \n\r !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀ';

var a = binaryDecode(js);
console.log(a.slice(0, 64));
console.log(a.slice(64, 128));
console.log(a.slice(128, 192));
console.log(a.slice(192, 256));

This code nicely prints out all bytes from 0x00 up to 0xFF.

Input file is 689 bytes in size. However, running this file through Closure compiler Advanced Optimizations produces a file that is 1225 bytes in size: ab_closured.zip

Online Closure link%2520e%255Bt%255D%253Dr.charCodeAt(t)-1%253B%250A%2520%2520return%2520e%253B%250A%257D%250A%250Avar%2520js%2520%253D%2520'%2501%2502%2503%2504%2505%2506%2507%2508%2509%255Cn%250B%250C%255Cr%250E%250F%2510%2511%2512%2513%2514%2515%2516%2517%2518%2519%251A%251B%251C%251D%251E%251F%2520!%2522%2523%2524%2525%2526%255C'()*%252B%252C-.%252F0123456789%253A%253B%253C%253D%253E%253F%2540ABCDEFGHIJKLMNOPQRSTUVWXYZ%255B%255C%255C%255D%255E_%2560abcdefghijklmnopqrstuvwxyz%257B%257C%257D~%257F%25C2%2580%25C2%2581%25C2%2582%25C2%2583%25C2%2584%25C2%2585%25C2%2586%25C2%2587%25C2%2588%25C2%2589%25C2%258A%25C2%258B%25C2%258C%25C2%258D%25C2%258E%25C2%258F%25C2%2590%25C2%2591%25C2%2592%25C2%2593%25C2%2594%25C2%2595%25C2%2596%25C2%2597%25C2%2598%25C2%2599%25C2%259A%25C2%259B%25C2%259C%25C2%259D%25C2%259E%25C2%259F%25C2%25A0%25C2%25A1%25C2%25A2%25C2%25A3%25C2%25A4%25C2%25A5%25C2%25A6%25C2%25A7%25C2%25A8%25C2%25A9%25C2%25AA%25C2%25AB%25C2%25AC%25C2%25AD%25C2%25AE%25C2%25AF%25C2%25B0%25C2%25B1%25C2%25B2%25C2%25B3%25C2%25B4%25C2%25B5%25C2%25B6%25C2%25B7%25C2%25B8%25C2%25B9%25C2%25BA%25C2%25BB%25C2%25BC%25C2%25BD%25C2%25BE%25C2%25BF%25C3%2580%25C3%2581%25C3%2582%25C3%2583%25C3%2584%25C3%2585%25C3%2586%25C3%2587%25C3%2588%25C3%2589%25C3%258A%25C3%258B%25C3%258C%25C3%258D%25C3%258E%25C3%258F%25C3%2590%25C3%2591%25C3%2592%25C3%2593%25C3%2594%25C3%2595%25C3%2596%25C3%2597%25C3%2598%25C3%2599%25C3%259A%25C3%259B%25C3%259C%25C3%259D%25C3%259E%25C3%259F%25C3%25A0%25C3%25A1%25C3%25A2%25C3%25A3%25C3%25A4%25C3%25A5%25C3%25A6%25C3%25A7%25C3%25A8%25C3%25A9%25C3%25AA%25C3%25AB%25C3%25AC%25C3%25AD%25C3%25AE%25C3%25AF%25C3%25B0%25C3%25B1%25C3%25B2%25C3%25B3%25C3%25B4%25C3%25B5%25C3%25B6%25C3%25B7%25C3%25B8%25C3%25B9%25C3%25BA%25C3%25BB%25C3%25BC%25C3%25BD%25C3%25BE%25C3%25BF%25C4%2580'%253B%250A%250Avar%2520a%2520%253D%2520binaryDecode(js)%253B%250Aconsole.log(a.slice(0%252C%252064))%253B%250Aconsole.log(a.slice(64%252C%2520128))%253B%250Aconsole.log(a.slice(128%252C%2520192))%253B%250Aconsole.log(a.slice(192%252C%2520256))%253B%250A)

lauraharker commented 8 months ago

The compiler defaults to outputting ASCII, but you can specify a different output charset via the --charset flag. (https://github.com/google/closure-compiler/wiki/Flags-and-Options#miscellaneous).

Does --charset=UTF-8 work for you?

juj commented 8 months ago

Thanks, yeah, that works!