guybedford / es-module-lexer

Low-overhead lexer dedicated to ES module parsing for fast analysis
MIT License
933 stars 48 forks source link

Return decoded specifiers and identifiers #61

Closed evanw closed 3 years ago

evanw commented 3 years ago

Context: https://github.com/vitejs/vite/issues/2083. Import paths are strings and can therefore contain escape sequences, like this:

import test from "./\u0442\u0435\u0441\u0442";
export default test;

Right now the documentation for this module seems to encourage using this library in a way that doesn't handle any escape sequences. Taking a raw slice of the input means that the backslashes are literal backslashes in the path instead of the actual underlying escape sequences. So in other words you would get the string "./\\u0442\\u0435\\u0441\\u0442" instead of the string "./тест" and then fail to import the path.

This can of course be worked around by interpreting the escape sequences yourself. But I thought I'd file this issue because it could be nice for either this library to do that for you or for there to be a recommended way of doing this, since one reason to use this library is to extract the import paths. If that's out of scope, maybe just calling out that you need to be sure to handle escape sequences yourself when you use this library with a link to the spec?

guybedford commented 3 years ago

Thanks, yes the API of this project could be more polished to avoid these footguns. It was actually originally written for https://github.com/guybedford/es-module-shims hence why it is entirely offset based, with a heavy focus on the overall library footprint.

I've added a new section to the readme in https://github.com/guybedford/es-module-lexer#escape-sequences, I think that should cover the concerns, but further clarifications welcome.

evanw commented 3 years ago

For this reason, when handling specifiers or identifiers it can advisable to pass the full string to JSON.parse:

JSON.parse(source.substring(imports[i].s, imports[i].e));

That workaround had occurred to me too. However, it doesn't work for a few reasons:

guybedford commented 3 years ago

It's missing the quote characters, so it won't actually work at all. You'd have to do this instead:

The .s and .e offset indices include the " quote characters exactly for this reason, so you don't need additions.

This is JavaScript, not JSON. They support different escape sequences.

Both of the examples support JSON.parse in my test.

It will technically break for single-quoted strings containing a double quote

Because of the first point, this case works out fine too.

guybedford commented 3 years ago

Because of the first point, this case works out fine too.

Ah, of course this doesn't, will put some thought to this one.

evanw commented 3 years ago

Here's my test:

import lexer from 'es-module-lexer';

function tryCatch(fn) {
  console.log(`\n${fn}:`);
  try {
    return fn();
  } catch (err) {
    return err;
  }
}

let js, imp;

js = `import 'abc.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse(js.substring(imp.s, imp.e))));
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

js = `import './\\x61\\x62\\x63.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

js = `import './\\u{20204}.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

I'm running with node v15.2.0 and es-module-lexer v0.3.26 from npm. This is the output:


() => JSON.parse(js.substring(imp.s, imp.e)):
SyntaxError: Unexpected token a in JSON at position 0
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:16:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:16:13

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
abc.js

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
SyntaxError: Unexpected token x in JSON at position 4
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:21:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:21:13

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
SyntaxError: Unexpected token { in JSON at position 5
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:25:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:25:13

Edit: I think indirect eval should work fine though:

import lexer from 'es-module-lexer';

function tryCatch(fn) {
  console.log(`\n${fn}:`);
  try {
    return fn();
  } catch (err) {
    return err;
  }
}

let js, imp;

js = `import 'abc.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './\\x61\\x62\\x63.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './\\u{20204}.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './".js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

The output:

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
abc.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./abc.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./𠈄.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./".js
guybedford commented 3 years ago

Ok, I've put together a PR in https://github.com/guybedford/es-module-lexer/pull/62. Let me know if that seems like it can work here.

evanw commented 3 years ago

Yeah that approach seems good to me.

guybedford commented 3 years ago

Released in 0.4.0.

evanw commented 3 years ago

I can confirm the fix. Looks good!