lark-parser / Lark.js

Live port of Lark's standalone parser to Javascript
MIT License
71 stars 12 forks source link

Generated JS parser fails if minified #17

Open cderossi opened 2 years ago

cderossi commented 2 years ago

I'm including a generated parser.js as part of a large web project. Webpack with the Terser plugin is configured to minify all of the javascript. The default (I believe) behavior is to rename all non-top-level classes to one or two-letter names.

But serialized data, like this:

  0: {
    name: 'NUMBER',
    pattern: {
      value:
        '(?:(?:(?:[0-9])+(?:e|E)(?:(?:\\+|\\-))?(?:[0-9])+|(?:(?:[0-9])+\\.(?:(?:[0-9])+)?|\\.(?:[0-9])+)(?:(?:e|E)(?:(?:\\+|\\-))?(?:[0-9])+)?)|(?:[0-9])+)',
      flags: [],
      _width: [1, 4294967295],
      __type__: 'PatternRE',
    },
    priority: 0,
    __type__: 'TerminalDef',
  },

includes names of classes as string literals.

When the classes are renamed, the data can no longer be deserialized and an exception is thrown.

Can a dictionary be built that maps the serialized type names to classes without relying on the class names themselves?

erezsh commented 2 years ago

Strange, it should already work like that -

In each subclass of Serialize, the __serialize_namespace__ method returns the class object, which are the provided as a namespace map for _deserialize, which then performs:

    if ("__type__" in data) {
      // Object
      class_ = namespace[data["__type__"]];
      return class_.deserialize(data, memo);

Sounds like it fails somewhere along the line, but I can't say where just from your description.

cderossi commented 2 years ago

There are a couple things happening, from what I can tell--

For example, take the serialized NUMBER object I included above.

The TerminalDef class gets renamed during webpack's optimization pass, in my case to Xr. So far, this is not so bad because

  static get __serialize_namespace__() {
    return [TerminalDef]
  }

gets turned into

{
    key: "__serialize_namespace__",
    get: function() {
        return [Xr]
}

But the constructor for class Xr also gets renamed to r. In fact, all constructors get renamed to r. Since class.name uses the name of the constructor

    namespace = Object.fromEntries(namespace.map((c) => [c.name, c]))

winds up making namespace into

{
  r : Xr
}

Since data["__type__"] == "TerminalDef" and there is no namespace.TerminalDef, deserialization can't proceed.

Even worse, since all constructors are renamed to r, if the __serialize_namespace__ has more than one entry, they collide and the resulting namespace object only winds up with a single field.

erezsh commented 2 years ago

@cderossi Thanks for telling me about this bug.

I created a PR that I think would fix it. Let me know if it works for you

(https://github.com/lark-parser/Lark.js/pull/18)

cderossi commented 2 years ago

@erezsh I tested your PR and it works great. That's an elegant solution.

Thanks for addressing this. I don't consider this your bug, since your un-minified code works and the minifier breaks it. But webpack is too common.

erezsh commented 2 years ago

Great. It's now merged, and will be included in the next release.

You're right it's not really a bug, but code lives in an ecosystem, and should try to play nice with it, if it can.

I am concerned that maybe you'll encounter more errors, because there are other parts of the code that use constructor.name, for example

  let tokens_by_type = classify(terminals, (t) => t.pattern.constructor.name);

Which is supposed to distinguish between two different classes. But if the constructors are all renamed to the same name, it might create the wrong behavior.

cderossi commented 2 years ago

I'm afraid you're right about more potential problems. I added some log statements to _create_unless and got this result:

image

I have not seen an actual error in my testing so far, though. Suggestions welcome.

erezsh commented 2 years ago

It's the mechanism that resolves collisions between keywords and variable names. It's not always easy to notice when it goes wrong.

Let me know if the new PR solves it - https://github.com/lark-parser/Lark.js/pull/20

cderossi commented 2 years ago

Looks good in the console and my parser output still looks correct:

image

My grammar is too simple to have any possible collisions between keywords and variable names, so I may not be the best test case.

Thanks again for your work!