guybedford / es-module-lexer

Low-overhead lexer dedicated to ES module parsing for fast analysis
MIT License
917 stars 48 forks source link

Gauging interest in upstreaming some changes #127

Closed aomarks closed 1 year ago

aomarks commented 2 years ago

I'm currently integrating es-module-lexer into a project within Google, as part of integrating https://github.com/google/playground-elements/, and in doing that there are a few changes I'm considering.

I'm wondering if you have any thoughts on these, and whether you'd be interested in receiving upstream PRs for them?

  1. Converting the lexer.js bridge to TypeScript. This was really straightforward since I was able to just integrate the typings already available in lexer.d.ts.

  2. Replace the use of eval for JavaScript string literal unescaping with a C version. eval is a security smell for obvious reasons, so even though the implementation here is likely safe because it's only passed strings that have already been broadly parsed as strings, we have some auditing that makes it difficult to use eval at all. Also it requires broadening CSP. So my thought was to extend the C lexer to deal with unescaping directly. Curious if you already thought about that, whether eval turned out to be faster, or just simpler/less code?

  3. I'm curious about the manual heap management at https://github.com/guybedford/es-module-lexer/blob/a28c3668119b538e18c459baf14e630bc27ff118/src/lexer.js#L10, and was wondering if using malloc in the C program would be a simpler/safer approach? Maybe it was that the malloc implementation from wasi inflated the code too much? Curious to hear your thoughts on that.

Thanks for the awesome library! Would like to contribute if any of these changes sound good. Otherwise we can just maintain a small internal fork.

aomarks commented 2 years ago

Oh and one more thought:

  1. parse currently either returns a result array or the promise of one, depending on whether initialization has completed. What would you think about init instead resolving to the parse function, and not exporting parse directly? That way the user can't accidentally write racey code where they sometimes unexpectedly get a promise? Plus it slightly simplifies the parse implementation, since it could assume wasm is always defined?

So usage would look like:

import * as esmLexer;
const parse = await esmLexer.init;
const [imports, exports] = parse(...);
guybedford commented 2 years ago

Hey @aomarks thanks for showing an interested in contributing and for providing such a clear contribution plan! That's a great approach to collaborating.

Would be very much open to new directions, although two of the major goals for this project are performance and size. In particular, this project was originally designed for es-module-shims. As a result, a heavy emphasis is placed on new features to entirely justify their overhead in terms of footprint, so we just need to work within these constraints.

Some feedback on your points:

  1. Converting the lexer.js bridge to TypeScript. This was really straightforward since I was able to just integrate the typings already available in lexer.d.ts.

This sounds great to me and a good way to keep typings in sync as long as we have a simple build (I'm partial to the swc chomp extension of course :P)

  1. Replace the use of eval for JavaScript string literal unescaping with a C version.

Note that we have a version of this project without eval and that is the asm.js version. The asm.js build is the CSP build that has no usage of eval already. In that build it adds some extra overhead certainly.

We could move those same techniques into the Wasm build, but it's not clear to me how that justifies the extra size since the usage of eval is "safe" as you say (we have already verified the token has not got control characters), and the main Wasm build already isn't CSP compatible. Is there a reason you can't use the asm.js build here?

  1. I'm curious about the manual heap management

Again, malloc implementations generally add many KB to a build, by avoiding malloc and retaining static heap management within carefully defined limits, we optimize for binary size

  1. parse currently either returns a result array or the promise of one, depending on whether initialization has completed. What would you think about init instead resolving to the parse function, and not exporting parse directly? That way the user can't accidentally write racey code where they sometimes unexpectedly get a promise? Plus it slightly simplifies the parse implementation, since it could assume wasm is always defined?

This sounds fine me and would be happy to post a v2 with the upgrade.

Note that if you don't have size constraints for your project there, and the restriction of size is too heavy for your use case you are more than welcome to fork the project there as well.

guybedford commented 1 year ago

Closing, but feel free to reopen discussion at any point here.