tc39 / proposal-binary-ast

Binary AST proposal for ECMAScript
964 stars 23 forks source link

The binary format doesn't have to directly mirror normal JS. #37

Open dead-claudia opened 6 years ago

dead-claudia commented 6 years ago

I'm not convinced it's required to marry the binary format too tightly to the JS source, and there are things you could do to make it simpler. It's not like the binary format is meant to be human-readable, just machine-readable. There are also idioms in compiled-to-JS code (e.g. from Elm, CoffeeScript, and Babel) that could also stand to benefit from a binary format that that's at least aware of some of their needs.

Here's a few of the ideas I have to throw at the wall. Apologies if this comes across as a bit rambly.

Yoric commented 6 years ago

Thanks for the proposals! Let me try and sort it a bit.

Kinda language extensions

  • Adding a constant for undefined
  • Allowing synthetic, anonymous locals that aren't viewable by direct eval or with
  • Adding combined type-checking operators for typeof x === "number", etc.
  • Adding combined coersion operators for x | 0/~~x, !!x, x + "", etc.
  • Requiring fallthrough to be explicit in switch statements.
  • Encoding the bytecode as a hybrid register/stack machine.

While each of these proposals makes sense individually, I really feel that they are mostly orthogonal to Binary AST. Recall that each addition to Binary AST will make its adoption more complicated by browsers and VMs. For this reason, we wish to keep it as minimal as possible, which means sticking exactly to the JS language.

Once Binary AST is a standard, we'll be happy to rediscuss each of these proposals, though :)

Kinda optimizations

Separating the pragmas from the source/function body.

That definitely makes sense.

Separating strings from the code, instead storing debug string/name references as {int offset; int length} pairs and putting their values consecutively in a string table before any real code references.

Already the case :)

Reducing logical "and"/"or" to corresponding if/else variants: test && other ↔(tmp = test) ? tmp : other, test || other ↔ (tmp = test) ? other : test.

Encoders are, of course, free to make such transformations, especially if they are optimizations. I don't think we want to hardcode them into Binary AST, though.

Using breakable blocks with an expression-based AST instead of a statement-based one.

I don't understand that one, could you elaborate?

Declaring locals before script/module/function body.

Already the case :)

Declaring imports/exports before module data table.

I like the idea. We'll need to think more about it.

Kinda metadata

Adding built-in source map support.

Definitely an objective. Not on the top of our list yet, but we're working on it.

Adding a built-in description/metadata field.

Indeed, we're thinking of a built-in metadata field. It shouldn't be hard to add, as long as it doesn't have semantics. I suspect that most toolchains will strip it, though.

j-f1 commented 6 years ago

For switches, how about having a table at the beginning that has gotos in it, then have an array of statements after the table?

Example ```js switch (expr) { case 'Oranges': console.log('Oranges are $0.59 a pound.'); break; case 'Mangoes': case 'Papayas': console.log('Mangoes and papayas are $2.79 a pound.'); // expected output: "Mangoes and papayas are $2.79 a pound." break; default: console.log('Sorry, we are out of ' + expr + '.'); } ``` ([source](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/switch)) **Binary AST**: Header: | Value of `expr` | Index | | --------------- | ----- | | `"Oranges"` | 0 | | `"Mangoes"` | 2 | | `"Papayas"` | 2 | | [[Default]] | 4 | Body: ```js console.log('Oranges are $0.59 a pound.'); break; console.log('Mangoes and papayas are $2.79 a pound.'); break; console.log('Sorry, we are out of ' + expr + '.'); ```
Yoric commented 6 years ago

@j-f1 Why not. What's the expected benefit?

j-f1 commented 6 years ago

It would allow the engine to skip over unused code and only parse the statements used in the current condition. Additionally, the engine would be able to determine if the default case needs to be taken without having to search the entire switch body for cases first.

dead-claudia commented 6 years ago

@Yoric Of course, some of those points and thoughts were more or less very forward-looking, and not exactly what I'd expect would have a chance in making it into an MVP. (Like, for example, the synthetic anonymous locals.)

Using breakable blocks with an expression-based AST instead of a statement-based one.

I don't understand that one, could you elaborate?

Think similar to WebAssembly and how its blocks work, just optionally a little heavier (if you need to introduce locals in them) and with unconditionally only one return value. That's pretty much what I had in mind.

WebAssembly itself is expression-based, it uses blocks with (br N value)/(br_if N cond value) and is a hybrid register/stack machine. Expression-based languages are typically denser and simpler, since they don't need as much bookkeeping on what is and isn't a valid statement (almost everything is), and because it's pretty common for control flow to only assign to something once (consider why the do expression proposal exists).

To expand on my concrete example:

An engine wouldn't need to do much work to infer that the result of a particular try/catch is only used in a particular variable if you just assign the result directly into the value itself. Also, it's not like you couldn't emit the statement-like form and it still work, too. Engines can desugar that into bytecode similarly to how they already do with let x; try { x = foo() } catch (e) { ... } now (in fact, I'd expect most implementations to do just that), and Babel could compile its try/catch blocks within do expressions pretty similarly.* (For reasons that make no sense to me, [CoffeeScript](https://coffeescript.org/#try:x%20%3D%20try%0A%20%20foo()%0Acatch%20e%0A%20%20bar()) and Babel + do expressions instead generate IIFEs.)

* In reality, an expression-based syntax means we wouldn't need to support do expressions in the core proposal, since it'd be flat out useless at best. An outer breakable block would be what early returns would compile into if they couldn't be compiled into an if/else.

Yoric commented 6 years ago

@j-f1

It would allow the engine to skip over unused code and only parse the statements used in the current condition. Additionally, the engine would be able to determine if the default case needs to be taken without having to search the entire switch body for cases first.

We're thinking about something kind-of like this, but if we do it, this will complicate the specifications of switch() wrt syntax errors.

syg commented 6 years ago

@Yoric did a great job separating out the @isiahmeadows' language feature ideas from other suggestions. One of the highest, if not the highest constraints in this design space is to not bifurcate the language. To that end, I take the existence of an "obvious" bijection (modulo whitespace and whatnot) between source JS and binary AST to be a hard constraint. That is, binary AST will not add any language features. What's expressible in source JS must be expressible in binary AST.

Like @Yoric said, the biggest reason is adoption, both for developers and for web browsers. Please do not take binary AST to be our trying to design a better JS. It is very much just the JS language with a more performant transport format.

I also saw mentions to bytecode. There will never be any sort of bytecode involved with binary AST.

dead-claudia commented 6 years ago

@syg I'm in high support of keeping it bijective. My hard constraint with all the "language feature" ideas I proposed were that every single one of them had to be trivially desugared. The only exception being synthetic locals used in a scope accessible to a direct eval.

I also saw mentions to bytecode. There will never be any sort of bytecode involved with binary AST.

I did sometimes conflate the binary AST representation with the concept of "bytecode", but I know it's technically incorrect - bytecode is a sequence of instructions, not a graph of nodes. I did make one suggestion here that would've actually made it bytecode in the technical sense, but it was kind of a thing thrown at the wall:

  • Encoding the bytecode as a hybrid register/stack machine.

I was already neutral on it to start, just throwing it out there for the sake of being thrown out there. I'm also not a big fan of the idea, since bytecode doesn't always end up smaller than a basic AST, and it's not always the most intuitive. And the more I think about this, the more I'm leaning against it. With this, all of a sudden you need a decompiler to debug the code, and we shouldn't be making browsers ship a glorified decompiler. This isn't WebAssembly, and even that one has chosen to make their bytecode a borderline AST format, just one that's mostly in reverse Polish notation for the node types. (Their text format lays this pretty bare.)

bakkot commented 6 years ago

@isiahmeadows:

every single one of them had to be trivially desugared

As you correctly point out, engines already do these sorts of simple translations, often at parse time. What's the proposed benefit to having the compiler do them? Keeping in mind that there's a real cost, which is adding new node types and defining and implementing their semantics.

dead-claudia commented 6 years ago

@Yoric I forgot to clarify that the proposed global "metadata" field was supposed to not have any real semantics attached. It's for tools, not engines.

dead-claudia commented 6 years ago

@bakkot

What's the proposed benefit to having the compiler do them?

Slightly less data over the wire, and engines don't need to enlist their heuristics to figure them out. Not all of them are especially justifiable, but some of them could be.

erights commented 6 years ago

@isiahmeadows

[undefined] is already a pseudo-literal in strict mode.

In what way? Just tested:

(function(){'use strict'; const undefined = 3; return undefined; })();

returns 3.

ljharb commented 6 years ago

undefined is only a pseudo-literal in ES5+ in the global scope (in that it can't be reassigned there), but it can be shadowed anywhere else; i don't think strict mode comes into play at all.

dead-claudia commented 6 years ago

@ljharb @erights I meant it's a pseudo-literal in that it has those special semantics - it's not a simple identifier in strict mode. It evaluates to:

You could say it's a context-sensitive, non-context-free construct that could be parsed as either a keyword or identifier, depending on what variables are declared in its surrounding lexical scope. Implementations almost necessarily implement this as a keyword at parse time, since strict mode requires special semantics, and punting the check to identifier lookup will inevitably create performance problems. Nope, just a combination of non-configurable, non-writable and in the global scope...

In sloppy mode, it's just a simple global binding lookup, so it's of course not a special syntactic construct.

erights commented 6 years ago

Where is this special case in the spec?

Is this special case observable?

dead-claudia commented 6 years ago

I edited it because I thought I knew how it worked, but apparently I was mistaken on particulars. 🙂 I thought it was a special case, but it's just a combination of descriptors + etc.

For what it's worth, UglifyJS only reduces undefined to void 0 in strict mode.