AST, IR, bytecode and friends - standards and inovation

refack commented 7 years ago

Spin off from https://github.com/nodejs/CTC/issues/100#issuecomment-294159465

The premise: people in nodejs have ideas that require manipulation of the AST or of IR. The people of V8 at present have no plan to open the engine's AST as API, but are close to achieving useable bytecode modularization (nodejs/node#11842).

Discussion points:

what can we do to promote language based innovation and optimization?
can we / do we want to do bytecode caching and optimization a la python's .pyc?
should we consider embedding one of the userland transpilers?

/cc @addaleax @hashseed @MylesBorins @indutny

[update 2017-04-20] Some motivations:

@Qard (https://github.com/nodejs/CTC/issues/104#issuecomment-295557941) is using AST manipulation for instrumentation of multiple version of express
New solutions such as util.awaitable/promisify (#12 or nodejs/node#12442), could come with an easy way to for userland to apply to everything.
The need for an AOP crosscutting-concern point-cut mechanism. e.g. Cross-cutting concern such as optimization and de-deopting (nodejs/node#12456) clash with readability/maintainability/DRY (also nodejs/node#12471)
Enter the TC39 stage process as a "compatible implementation"

refack commented 7 years ago

Feel free to edit my title/desc...

refack commented 7 years ago

So I was asking, should we consider embedding babel as a frontend to complement V8 as a back end? Making us innovators not only in performance and portability, but also in the language aspect.

refack commented 7 years ago

/cc @Qard nodejs/node#12349

hashseed commented 7 years ago

Just to be clear: V8's Ignition bytecode is not a serialized AST. Its format is not designed to be public, but rather an implementation detail. The bytecode encodes data specific to V8's design choices (value representation, tagging scheme, call convention, object layout, metadata slot indices, etc.). Currently it is evolving quickly. We cannot afford to freeze the bytecode format or even specify it formally.

I agree that it is desirable to have an intermediate format to avoid some parsing, but Ignition bytecode is not that format. When you think of bytecode, you probably think of Java bytecode. Ignition bytecode is very unlike Java bytecode in that it is very much intertwined with V8's implementation details, whereas Java bytecode is a public format shared between many Java implementations.

WebAssembly is much closer, though with caveats.

MylesBorins commented 7 years ago

One thing that I'm really interested in is how we are going to be representing the DAG of an ESModule based import before handing it over to V8 to execute. This is something that @bmeck could comment on. I'm unsure if there are plans to have lifecycle hooks here, but this could potentially give us a defined DAG to crawl and also a way of dumping an intermediate representation for faster boot times (this is completely theoretical btw).

In the past I helped organize an AST summit involving stakeholders from Esprima, Acorn, and Babel... perhaps some of those folks could chime in. I'm not 100% sure if that a standard is something that should live in core, but perhaps it is something we could collaborate on.

/cc @ariya @mikesherov @thejameskyle @kittens @hzoo

bmeck commented 7 years ago

Not entirely sure I understand the DAG needs for the format besides mapping of specifier to resolved IR subgraph. I think it might be possible to create bundled DAG of ESM, but it might not always be possible (like if a dep is CommonJS) which we have to do some odd things outside of ESM source code and pass refs into ESM.

bmeurer commented 7 years ago

The V8 team has been looking into parsing speed a lot during the last 12 months, since this is highly important for page load these days. The V8 parser improved a lot during that time, and might improve even further. Since the engine cannot just blindly trust the bytecode, some verification has to happen as well, plus the external bytecode must be compiled to our internal bytecode. All of this takes time, and at this point we don't have reasons to believe that it'll be beneficial, based on the data we have.

Qard commented 7 years ago

While I'm working on ways to better support AST manipulation in core right now, I don't believe it's currently the right time to bundle the AST manipulation part itself into core. V8 is a moving target in that regard, and it would be a substantial issue for the VM-agnostic efforts to deal with. I also don't think we should be pulling larger codebases like esprima into core.

In my opinion, the best way forward is minimal changes to the require system now and evaluating whether a standard interface across VMs for AST manipulation is worth pursuing. (My gut feeling is no, but who knows?)

jamiebuilds commented 7 years ago

I'm not sure I understand the value in having Node standardize around an AST. But if I was to suggest are starting place if would be the Shift AST: https://github.com/shapesecurity/shift-spec

benjamingr commented 7 years ago

I believe this is an interesting topic, but:

The premise: people in nodejs have ideas that require manipulation of the AST or of IR.

Who? Why?

Qard commented 7 years ago

Application Performance Monitoring vendors want it for safer and better performing instrumentation than monkey-patching everything async in the environment with a bunch of closures.

It's also useful for language transpiling, applying code coverage tracking instrumentation, stubbing files for mocks in a test suite and numerous other possibilities.

bmeck commented 7 years ago

@Qard is this a lack of hooks problem, or are people actually modifying how non-exported code runs?

jamiebuilds commented 7 years ago

If the Node Foundation wants to get more involved in the JS-to-JS compiler pipeline for JavaScript I can introduce the Babel team, I know that they are looking for more support, and Babel is effectively the de-facto tool for this in the community.

benjamingr commented 7 years ago

@Qard thanks! Do you mind showing me an example of how one would use such hypothetical tools?

sam-github commented 7 years ago

@benjamingr not sure what you mean by hypothetical, both @Qard and I as well as others in the APM vendor space do this. Here is actual code:

  var Module = Object.getPrototypeOf(module)

  Module.__compile = Module._compile

  Module._compile = function (content, filename) {
    if (fileTest(filename)) {
      content = tx.transform(content, filename)
    }
    return this.__compile(content, filename)
  }
}

Our tx.transform adds function entry/exit timing information using a source code transform. @Qard's probably does something similar, though what everyone does is specific to the type of information their APM wants to extract.

bmeck commented 7 years ago

@Qard @sam-github would https://github.com/v8/v8/blob/master/include/v8.h#L6288 or similar being exposed fix that?

matthewloring commented 7 years ago

@bmeck In our APM work we do need to patch non-exported module internals fairly frequently.

bmeck commented 7 years ago

@matthewloring is that to modify behavior or get timing etc.?

matthewloring commented 7 years ago

It is usually to either collect timing information or to fix userspace queuing issues (see 5 under desired features) that break async context propagation.

The FunctionEntryHook is quite heavyweight and will run on every generated functions when most APM solutions only care about a small number.

sam-github commented 7 years ago

@bmeck I don't know enough about the V8 to know if that would help, but it looks to me like as @matthewloring says, it runs for every single function, almost certainly cannot be used to reenter into V8 and run js code, probably is a singleton (it can't be used by an APM module and by the profiler), and it can't be used to pass closure info to solve the user-spacce queuing, aka "connection pooling", problem (though the async-hooks APIs may eventually help with that).

bmeck commented 7 years ago

@matthewloring for the queueing issue I defer to async_hooks. If the mutation is limited to a subset of functions, I assume the mutation is determined by parsing an AST and using some heuristic? If so, that might be something to look into. However, I think a problem existing here is that there is no AST standard given ESTree, Babylon, Shift, the V8 internal, etc.

The problem sounds real, but I am not convinced AST is the appropriate way to fix this even if it is the current approach.

sam-github commented 7 years ago

I'm not sure If we should add official APIs for rewriting. patch Module._compile actually works OK for now.

To make me feel good about blessing the practice in perpetuity with an official API, I'd want to see that its something that would help across a range of use-cases, and APMs have hacked around this sucessfully for a while. Unless the API was backported to LTS, we couldn't even use it until 8.x is no longer in LTS, which is years from now. Which is maybe a reason to get it into 9.x:-).

The babel/typescript/etc runtime transpiler use-case, for example, is something that we have traditionally told users to do as a pre-processing stage before passing the output javascript to Node. Will making it be easier to do with an ENV var or one of these mechanisms noticably improve the user-experience? How so?

benjamingr commented 7 years ago

@sam-github thanks, but I meant in terms of how the new potential API would help you?

sam-github commented 7 years ago

We do mutation by file-name matching, you can see that in the example code. We further use either heuristics or else hard-coded pre-knowledge of how a npm package works attempt to only patch code that is a top level API exported by the module, or that needs async context passing. Exactly what gets patched depends on why we are patching, I've worked on 4 different modules which do this by now which have all ended up at IBM (strong-agent, a method tracing tool, appmetrics, and BAM).

sam-github commented 7 years ago

@benjamingr I didn't say it would help me: https://github.com/nodejs/CTC/issues/104#issuecomment-295329449

Having to do this over and over by patching an undocumented Module method is a bit ugly. But just encoding ugliness in an API maybe is still ugly, too. Hopefully we'll see more examples reaised, maybe there is something that core could expose as an API that would make this better, maybe not. I'm undecided.

sam-github commented 7 years ago

@bmeck and I don't think absence of an AST standard is a problem. If the hook exposes javascript, how people rewrite the js is up to them. As you list, there are many options, I don't think there is any request here that node expose an AST and support rewrite of it. It might be cool if it did, but given esprima, its not really necessary.

bmeck commented 7 years ago

@sam-github if this isn't about AST/IR/bytecode as in the title, we should make a diff issue.

sam-github commented 7 years ago

@bmeck sorry! I totally confused this with https://github.com/nodejs/node/pull/12349, I'm awash in github notifications.

Qard commented 7 years ago

Okay, so in the case of TraceView, the product I used to work on, there is extensive need for modifying behaviour for the purpose of capturing meta data to describe the state of the async thing being measured.

Here's an example of collecting some metadata related to a mysql query by patching the query method:

https://github.com/tracelytics/node-traceview/blob/master/lib/probes/mysql.js#L120-L126

You'll also see that, further up in that file, I had to patch connect and getConnection. This was purely to pass the continuation context from the queuing end of the function to the callback running step. This is user-mode queuing, which is really not solvable with async_wrap without convincing userland modules to use the js API to async_wrap to signal these edges. This could be solved fairly easily without introducing the many extra closures involved by simply applying an AST transform.

There's also potential dangers to monkey-patching the world. Here, I'm patching the postgres driver:

https://github.com/tracelytics/node-traceview/blob/master/lib/probes/pg.js#L17-L30

It conditionally patches pg.native with a getter/setter hack because accessing the property turns the native driver on, which destructively modifies the rest of the module. This is seriously painful and, again, could be solved easily with an AST transform.

Here's the hapi instrumentation:

https://github.com/tracelytics/node-traceview/blob/master/lib/probes/hapi.js

That code jumps through a long list of hoops to detect the module version both at initial patch time and in various points at patch time, the hapi module changed dramatically over the years, so there's piles of checks required all over the place to see what things look like to determine how it should apply the next instrumentation step. It's also patching the vision module in there, which used to be part of hapi itself and was later pulled out. It has to do some scary things to figure out how to get at the view system at runtime and patch it live. There are patches in there which need to be applied individually for each request. Once again, with AST transforms, that could be applied once and with pre-checking to determine how the instrumentation should be applied.

All that benefits greatly from AST transforms, but it's not too overly difficult to hijack module.constructor.prototype._compile to add an AST transformer to that...however, built-in modules do not use module._compile, they follow a completely different flow of getting compiled with ContextifyScript here:

https://github.com/nodejs/node/blob/master/lib/internal/bootstrap_node.js#L466-L470

This means AST transforms just plain can not be applied to built-in modules. This means the http instrumentation for http is pretty gross looking. In particular, there's this nasty little hack here:

https://github.com/tracelytics/node-traceview/blob/master/lib/probes/http.js#L246

That right there uses this terrifying pile of hacks to produce a FILO queue for exit events to encapsulate the layers of the request and end the transaction trace when the request stream ends. The code that handles that is here:

https://github.com/tracelytics/node-traceview/blob/master/lib/index.js#L397-L436

Basically, if you want to report more meaningful data beyond simple timing, you're in for a rough time and a ton of potential performance pitfalls. 😱

Qard commented 7 years ago

So, as was referenced above already by @refack, I've been working on https://github.com/nodejs/node/pull/12349. I believe trying to expose AST manipulation in Node.js itself is likely a very significant challenge, especially considering efforts to be VM-agnostic.

It'd basically need to be done in a way that is purely disconnected from the VM itself, in which case, why not just let userland use the already well-established esprima or shift AST parsers/generators?

What I'm aiming for is a simple way to intercept a module between loading the file and handing it to the VM to parse, allowing whatever text-to-text transformation you like to be applied to it. It does mean a larger performance hit to require() calls of modules that get pushed through the parser/generator, but I'd much rather a slower startup than a slower runtime behaviour.

Qard commented 7 years ago

@benjamingr Here's what I've been building to do AST parse/generate based instrumentation:

https://github.com/Qard/node-module-compile-transform-ast/blob/master/test.js

For each require, it'll do a regex test against the filename and if a match is found it'll run the corresponding patch handler function on the AST.

Sorry for the simplicity of the sample there. I haven't had time to continue working on it for the past week as I've been travelling, but the intent is to simply define patch handler functions for loaded file paths that match a given regex pattern--exact string matching also works. The regex matching is simply to enable spotting nested dependencies.

I'm hoping to carve out some time to put together a proof-of-concept AST-based instrumentation for multiple versions of express sometime in the next few weeks. That'd prove the value on a userland module.

Again though, there's currently no way to intercept the a file contents pre-parse, thus my PR.

benjamingr commented 7 years ago

@Qard thanks, that's great as a motivating example. I think we should summarize this and put it at the top of the issue so the benefits can be obvious.

hzoo commented 7 years ago

If I'm not mistaken, node-module-compile-transform sounds like what babel-register does?

Qard commented 7 years ago

Basically. It's a bit lower-level though. It runs any arbitrary function that receives and returns a string rather than running babel on it. The node-module-compile-transform-ast module just wraps that in an AST parse/query/generate helper.

refack commented 7 years ago

I believe this is an interesting topic, but:

The premise: people in nodejs have ideas that require manipulation of the AST or of IR.

Who? Why?

@benjamingr I think also nodejs/node#12471 fits in this conversation, as is nodejs/node#11842 and IMHO embedding a "front-end" transpiler will help up drive language innovation (even your #12 and nodejs/node#12442 fits in here, as an AOP crosscutting-concern)

refack commented 7 years ago

The babel/typescript/etc runtime transpiler use-case, for example, is something that we have traditionally told users to do as a pre-processing stage before passing the output javascript to Node. Will making it be easier to do with an ENV var or one of these mechanisms noticably improve the user-experience? How so?

@sam-github having a nodejs endorsed "front-end compiler" solution has allot of "power", as you mentioned from LTS and other enterprise concerns, to reducing "getting-started" decision fatigue, to enable us to implement cross-cutting concerns, such a de-deopting (nodejs/node#12456), promisifiying (#12), and finaly simply driving language innovation, by being considered one of the tc39 "compatible implementations"

bmeck commented 7 years ago

@refack util.promisify does not need any AST or IR transform, if you read the PR there are special cases in it as well so AST transform are probably not a viable approach.

bmeck commented 7 years ago

I should state here, I am very concerned if we ship Signed Packages in any way we should/need to make all transforms opt-in by the Signed Package via per package config or an opt-in API. The more I think about this, the less I think AST transforms are right to land in core, and might cause problems if a Signed Package exists since it would not be transformed automagically.

refack commented 7 years ago

@refack util.promisify does not need any AST or IR transform

Agreed, but adding a simple rule to promisify everything will...

transformed automagically

I'm in line with minimising magic, everything should be opt-in (+ user rules with exclusions).

bmeck commented 7 years ago

@refack as shown in the PR it is not a single/simple rule : https://github.com/nodejs/node/pull/12442/files#diff-0a5d4868b2b9b17cf9e2c11f1bd1311eR390

refack commented 7 years ago

@refack as shown in the PR it is not a single/simple rule : https://github.com/nodejs/node/pull/12442/files#diff-0a5d4868b2b9b17cf9e2c11f1bd1311eR390

That's the implementation, I agree "good" problems, have "elaborate" solutions, and also as a core solution is should be complete. But once we have util.promisify what if an end-user wants to promisify all core modules, it should be simple. Or they could write an "incomplete" solution that works for them, and hook it in (instead of money-patching it in). Even a ubiquitous solution such as babel-register (and coffescript-register before that) is a monkey patch that (ab)uses the deprecated require.extentions

refack commented 7 years ago

Added a few summery point to first comment.

benjamingr commented 7 years ago

@refack I'm also not sure what the utility for util.promisify is, but I definitely see the appeal for instrumentors and the cases @Qard mentions and I think we should focus on them :)

benjamingr commented 7 years ago

@bmeck

The more I think about this, the less I think AST transforms are right to land in core, and might cause problems if a Signed Package exists since it would not be transformed automagically.

That's a very good point, signed packages need to opt out of this - at which point I'm wondering if the AoP value this brings is less strong.

I'm wondering if this is a deal breaker for the instrumentation use cases.

refack commented 7 years ago

@refack I'm also not sure what the utility for util.promisify is, but I definitely see the appeal for instrumentors and the cases @Qard mentions and I think we should focus on them :)

Not arguing, just think of promifyAll as another instrumentation use cases.

The more I think about this, the less I think AST transforms are right to land in core, and might cause problems if a Signed Package exists since it would not be transformed automagically.

That's a very good point, signed packages need to opt out of this - at which point I'm wondering if the AoP value this brings is less strong.

I'm wondering if this is a deal breaker for the instrumentation use cases.

Just quoting "Signed Package are not the holy grail" they are one of a few competing concerns. A user could always choose to either go "gold": LTS engine, "magic" opt-out and only "Signed Packages". Or they could go cutting edge: "latest" + "transpiler" + "esnext strawman" + optimize at "Ludicrous Mode"

refack commented 7 years ago

@bmeck @benjamingr @sam-github A real question, what happens when a client wants "gold standard" i.e. LTS engine + only Signed Packages, but also "most have" APM (as client always do ;) ) ? Maybe we need a mechanism for a "Signed Instrumentor"? (ugly example but real world: Microsoft "Trusted Publishers" for drivers)

sam-github commented 7 years ago

@refack Signed Packages don't exist, not even in proposal form, hard to speculate on how they will interact with other features.

bmeck commented 7 years ago

@sam-github https://github.com/dimich-g/webpackage exists and I have been working to try and make sure Node is able to adopt it when all parties are agreed on it. Electron, Web Browsers, and document publishers are also looking at it

@refack

A real question, what happens when a client wants "gold standard" i.e. LTS engine + only Signed Packages, but also "most have" APM (as client always do ;) ) ? Maybe we need a mechanism for a "Signed Instrumentor"? (ugly example but real world: Microsoft "Trusted Publishers" for drivers)

My guess is more that a hook mechanism could be exposed that allows this, async_hooks solves a lot of the issues, it looks like sync trace hooks are still wanted/needed. I am just stating that I don't think AST/IR is the right approach to solve these problems.

refack commented 7 years ago

My guess is more that a hook mechanism could be exposed that allows this, async_hooks solves a lot of the issues, it looks like sync trace hooks are still wanted/needed. I am just stating that I don't think AST/IR is the right approach to solve these problems.

@bmeck If I understand you right, you say that AST/IR manipulation might not be the right solution for APM, or similar structured instrumentation (that could be mapped to async_hooks or any future "trace_hooks"). But are leaving the door open for general AoP where pointcuts are not mapped to hooks?

If I understand you right I totally agree; well defined hooks hands-down beat generic instrumentation fiddling with code/AST/IR "in the dark". They are just harder to accomplish as a complete core solution (async_hooks are in the works for at least 3 years, plus an unsuccessful attempt with Domains)

Personally with this discussion I would want to focus on enabling more adventures innovation, not replacing other better solutions...

Trott commented 7 years ago

This issue has been inactive for a while and this repository is now obsolete. I'm going to close this, but feel free to open another issue in a relevant active repository (TSC perhaps?) and include a link back to this issue if this is a subject that should receive continued attention.

nodejs / CTC

AST, IR, bytecode and friends - standards and inovation #104