lydell / js-tokens

Tiny JavaScript tokenizer.
MIT License
503 stars 31 forks source link

Properly identify regex and division using Lookbehind assertions in regex #11

Closed 3cp closed 4 years ago

3cp commented 5 years ago

Following the documented Limitations on "Division and regex literals collision", since current oldest Nodejs LTS v8 has Lookbehind assertions in regex, please update the implementation if possible. Thanks!

lydell commented 5 years ago

See also: https://github.com/lydell/js-tokens#es2018

lydell commented 5 years ago

Current plan:

Node.js 10 supports named capture groups. Node.js 8 becomes end-of-life at the end of December 2019. I'd like to develop the next version, which will use named capture groups, unicode and lookbehind, of js-tokens during this autumn, and then release it early 2020. This way I can batch all breaking changes together.

Note to self – links on handling regex vs division better (but not 100%):

EDIT: Work is happening in the next, func and jsx branches.

3cp commented 4 years ago

I like you switched from pure regex to code+regex. It's not only more flexible, but also more readable. I guess it can also support older nodejs/browser since it relies much less on regex features.

lydell commented 4 years ago

Thanks!

I mostly switched to code+regex since I realized that lookbehinds didn’t help much after all. And then I also got infinite template interpolations for free.

I still use Unicode property escapes in the regex (/\p{L}/u), which requires Node.js 10 (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Browser_compatibility), so Node.js will still be the lowest supported version.

I hope to finish this this weekend.

3cp commented 4 years ago

Unfortunately it doesn't work in Firefox. Babel has a plugin to transpile it to es5, https://mothereff.in/regexpu#input=var+regex+%3D+%2F%5Cp%7BL%7D%2Fu%3B&unicodePropertyEscape=1

It is very verbose.

lydell commented 4 years ago

I’ve never tried to support browsers with this module. Are you using it in browsers? If so, in what kind of project?

3cp commented 4 years ago

I got a bundler can work in browser, also an online IDE runs bundler in browser right now. https://github.com/dumberjs https://github.com/dumberjs/dumber-gist

The tokenisation is for https://github.com/dumberjs/modify-code (which dumber uses). Right now I use babel parser to tokenise code, but it's big and slow. I cannot use current version of js-tokens because of the regex/divider issue.

3cp commented 4 years ago

Not a big issue, I can transpile (or copy-and-modify) your module and embed it in modify-code dist file through rollup I think.

lydell commented 4 years ago
  1. Note that the new version won’t be correct in all cases either. For example, it will think that if (x) /a/g.exec(s) is division when in fact it is a regex.

  2. Just out of curiosity, I’ve never heard anyone have trouble (in real life, I know it’s far from perfect) with the regex/division issue so far before. Do you have an example where it causes problems for you?

  3. What are your requirements? Need a (near) standards compliant tokenizer, but something like esprima is too large/slow?

  4. The code of this package doesn’t change very often so, yeah, I think you could easily fork it for your needs. Feel free to share if you do!

3cp commented 4 years ago
  1. It's sad to know the new version still has unhandled cases. :-( But this might not be an issue for my use case, because it's not missing any token, only creates unneeded tokens. I think you could fix the issue. That / is in context of beginning of a statement. Divider cannot start a statement I guess.

  2. I tried js-tokens before, but it gave a huge fake regex started by a divider. This is very common when deal with minified js code. The result is the fake regex missed tons of real tokens.

  3. yes a (near) standards compliant tokenizer. I use the tokens to drive source-map. I used few parsers before, esprima is not actively updated to follow current js spec. I encourtered cherow bug and now it's deprecated. Both babel parser and typescript parser can understand all syntax (including ts and jsx which are not in spec). Typescript parser is even bigger, so I settled on babel parser. Evaluating meriyah now, but that means to drop ts syntax support. The good part of js-tokens: even it doesn't support ts/jsx, it will generate tokens just fine.

  4. fork is easy. But understand and fix all the edge cases are hard 😄

3cp commented 4 years ago

BTW, I am surprised @babel/highlight uses your module instead of babel's own parser, and it indeed gives wrong color.

const highlight = require("@babel/highlight").default;
const code = `var p /= 2/g;`;
const result = highlight(code);
console.log(result);
Screen Shot 2020-03-27 at 8 53 58 am
3cp commented 4 years ago

I got your point on if (x) /a/g.exec(s) now, because js-tokens is not a full parser, it's hard to understand the context. But a full parser will be much bigger than js-tokens.

lydell commented 4 years ago

Thank you very much for all the detailed feedback! I’ll keep that in mind while finishing the new version.

In case of babel, they use @babel/highlight to colorize broken code that caused parse errors. Babel’s tokenizer can throw errors, while js-tokens never throws errors for any string input. They’d have a bit of a catch-22 situation otherwise. Anyway, var p /= 2/g should be tokenized correctly by the new version.

lydell commented 4 years ago

The latest version of the func branch now supports if (x) /a/g.exec(s) and all other edge cases I’ve been able to think of so far!

3cp commented 4 years ago

Suggest to add few test fixtures for jsx and ts code. Although js-tokens did not support them deliberately, they should work as expected.

lydell commented 4 years ago

Good idea, I’ll add some jsx and ts fixtures.

In other news, I’ve started adding a test262 parser test. If @babel/parser is bug free, js-tokens only mixes up division and regex in one case:

switch(a) { case 1: {}
/foo/ }

That’s supposed to be regex, but js-tokens thinks it’s division. If it’s easy enough to fix I’ll do it, otherwise it’s going to be a known limitation.

Edit: Note to self:

label:{}/a/g

switch(x){
  case 1:{}/a/g
}
3cp commented 4 years ago

CoffeeScript supports babel https://coffeescript.org/#transpilation

How about, besides the /index.js, create another /browser.js file transpiled by babel to support Firefox and other browsers?

Usage in nodejs environment will be unchanged, but various js bundlers will honour "browser" field in package.json, /browser.js will be used in browser environment.

  "files": [
    "index.js",
    "browser.js"
  ],
  "main": "index.js",
  "browser": "browser.js",

If this makes sense to you, I can help on the setup after you released next major version.

lydell commented 4 years ago

That might be a good idea! But before doing so I have a question:

I’ve not been super into the frontend compilation space the last half year or so, but before that I know that webpack and babel worked on being able to babel on code from node_modules, so that the ones who need it can transpile to the level of support wanted. Is that a thing now? Can you set up your build to transpile either all of node_modules or just js-tokens?

3cp commented 4 years ago

By default, bundlers skip transpiling on node_modules. But yes, there is way to transpiling node_modules (all or some) in bundler config.

However, this is not easy for end users. Besides the manual configuration, there is also kind of unpleasant troubleshooting first that users need to go through to find out what 3rd-party package breaks in browser and what level of transpiling is needed to fix that.

So it's better to ship pre-compiled dist files, so it just works in both nodejs and browser env. It looks quite easy to setup based on what I read from CoffeeScript doc.

lydell commented 4 years ago

But the typical user does not need js-tokens in their builds, right? It’s just a couple of pro users like you?

Also, if we do add a ES5 browser version – what if someone actually wants the modern and much smaller version, because their app only supports modern browsers anyway? Is that still possible, or is that causing annoying manual configuration and troubleshooting?

3cp commented 4 years ago

That is very easy once you shipped multi dist files.

In bundler config, user can force a main file for a npm package. This is much easier than setup babel transpiling on a npm package.

First, multi dist files erased manual config for at least 90% users. For the rest pro users, multi dist files are easier for manual config than figuring out transpiling.

lydell commented 4 years ago

I’ve decided not to include a version run through Babel, because my testing shows that it would increase the package size by 60%. Since this package is downloaded more than 20 million time per week (and growing), that would amount to around 130 GB per week of wasted bandwidth.

In other news, I’ve finished test262 and added support for JSX via { jsx: true } (in the jsx branch). What’s left to do is adding more JSX tests and then look everything through again.

lydell commented 4 years ago

v6.0.0 has now been released! Hope it’ll be useful for you.

3cp commented 4 years ago

Thx! will test it.

lydell commented 4 years ago

Unicode property escapes is coming in Firefox 78! https://hacks.mozilla.org/2020/06/a-new-regexp-engine-in-spidermonkey/

3cp commented 4 years ago

Cheers!