Positions (à la source maps)

sjrd commented 7 years ago

The current text of the proposal does not mention anything about mapping binary AST offsets back to source positions. If the original input is JavaScript source code, then I understand that debugging tools could show the pretty-printed version of the AST. However, if the input JS or the binary AST itself was produced by a compiler from another language, we need some way to map back to the original source file.

Directly leveraging source maps as they currently exist does not seem possible. They rely on very precise positions in the compiled .js file. If the .js file used for source mapping is the product of pretty-printing the AST, the source maps are very sensitively dependent on the pretty-printing algorithm. This would require to specify that algorithm along with the binary AST specification. IMO that is not a desirable situation.

Instead, I would suggest two possible paths to address this.

The first would be to store original positions of nodes directly in the AST. This has the advantage that it would produce extremely accurate source positions through the compilation pipeline of the VM, eventually resulting in better source mapping for the binary AST than what we enjoy with source maps at the moment. The disadvantage is that the binary AST itself is encumbered by positions, which would probably amount to a significant portion of the file size. Although useful for development "builds", it would unnecessary increase bloat for production files. That could be mitigated by a global flag at the beginning of the file telling whether positions are stored or not.

An alternative would be to define an equivalent to source maps, specifically designed for binary AST. Such source maps would map binary AST offsets to source positions.

bmeck commented 7 years ago

What would be wrong with the pretty printer also generating a new source map?

sjrd commented 7 years ago

What data would it generate the new source map from? It needs to know somehow what the original positions are. If it sees only a Binary AST file, how can it come up with "foobar.cljs, line 15, column 3"?

bmeck commented 7 years ago

@sjrd with chained source maps, this already occurs in the wild. Some transform like babel will generate a sourcemap, then some bundler like webpack will generate a new source map that points to the old one as well and chain lookups.

sjrd commented 7 years ago

For any kind of chained source mapping to work, you need at least what I propose as "An alternative" in the last paragraph. Otherwise you are lacking the first link of the chain.

bmeck commented 7 years ago

There are source map comments ( see example in https://developer.mozilla.org/en-US/docs/Tools/Debugger/How_to/Use_a_source_map ) that we could serialize. Also people can use HTTP headers that allow this ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/SourceMap ) if it needs to be out of band.

sjrd commented 7 years ago

From your last comment, I think you misunderstand the issue I'm describing. Telling the browser that there exists a source map for foo.binary-ast and that it's located at foo.binary-ast.map is not the problem. The problem is what is contained in that .map file.

A source map file is a function from source position to source position. The codomain of that function is clearly defined: it is a position in a source file (e.g., Foo.scala). The domain of that function, for a .js file, is also clearly defined: it is a position in that .js file. But in the case of the Binary AST, the domain of the source mapping function is currently not defined whatsoever.

The purpose of this issue is to fine some way to address this.

bmeck commented 7 years ago

@sjrd I don't understand. I think you want source maps to always have a well known transform from JS Source Text to the AST but I don't think that is reasonable. Converting could re-order things I'd imagine:

function f() {}
function g() {}
f(g());

Could have some transform that is done by a tool that generates an AST equivalent of:

function g() {}
function f() {}
f(g());

Are you trying to remove the ability for ASTs to map to non-JS Source Texts by mandating positions rather than leaving that up to code generators?

sjrd commented 7 years ago

On the contrary, I am trying to add the ability for ASTs to map to non-JS source texts. I want the code generators to be able to tell the VM where any given Binary AST node comes from, whatever language that source was written in.

I'm asking for basically the same feature as source maps, but where the domain of the function is a binary AST offset instead of a position in the compiled .js file.

I think you want source maps to always have a well known transform from JS Source Text to the AST

No, I don't.

bmeck commented 7 years ago

@sjrd wouldn't this be the same as a codegenerator that had a sourcemap that pointed to the .scala files by precomputing the chained offsets?

sjrd commented 7 years ago

It would if there was a chain to begin with. But right now the first link of the chain is missing.

I can compile a Foo.scala file to foo.js and emit a perfect foo.js.map that maps positions in foo.js to positions in Foo.scala. If I then parse and store the AST of foo.js into foo.binary-ast, I currently have no way to map the offsets in foo.binary-ast to either positions in foo.js (the first link of the chain) or Foo.scala (the whole chain). foo.js.map has become completely useless.

bmeck commented 7 years ago

If we serialize SourceMap comments it could be inline. If you send the HTTP header it works right now?

sjrd commented 7 years ago

No. I have addressed this rebuttal already in https://github.com/syg/ecmascript-binary-ast/issues/17#issuecomment-323385317.

bmeck commented 7 years ago

@sjrd I disagree with your rebuttal since we have mechanisms to get these source maps. The source maps can point to either a chain or to precomputed maps. I see no gain in functionality.

But in the case of the Binary AST, the domain of the source mapping function is currently not defined whatsoever

A source map is a function of position -> position, I see no reason you cannot use a byte offset as a position currently.

sjrd commented 7 years ago

I see no reason you cannot use a byte offset as a position currently.

Because a position, in a source map, is a triple (source file, line, column). An offset is an integer.

At this point I think we are talking past each other, and completely polluting this thread and the mailboxes of watchers. I will let others decide whether this issue needs more clarification.

bmeck commented 7 years ago

Because a position, in a source map, is a triple (source file, line, column). An offset is an integer.

We still need the source file reference. I would always expect the same line number, just like minified JS or WASM binary format. Column is just a number here.

This is hinted at somewhat in WASM's design for tooling.

Stating that the line should always be 0 explicitly might be enough?

sjrd commented 7 years ago

Stating that the line should always be 0 explicitly might be enough?

Sure, that would be enough. It has to be specified, though ;)

It might not be the most efficient encoding, but that's secondary, as long as there is something specified that works.

sjrd commented 7 years ago

Following @Yoric's suggestion, here are some thoughts derived from my experience in designing the serialized form of the Scala.js IR, which is in a tree-based format and contains source positions, optimized for size (so lives in a constraints space relatively close to the Binary AST).

Some of the ideas are also present in source maps (specification here), and some are improvements because the format of the map is binary instead of text, and the annotated target is an AST instead of source text.

Basic considerations

When we have a text format, positions are implied by the offsets in the file, and therefore weigh 0 byte. Once we go to a serialized AST, we must explicitly store positions, but to be competitive we need to strive for a short encoding that is fast to parse.

As a first approximation, an AST will follow some source code more or less closely. This means that AST nodes in the binary format will follow each other more or less as tokens follow each other in the input file. This means that we can store position diffs instead of absolute positions, because the diffs are likely much smaller than the absolute positions. This idea is exploited by the format for source maps, actually.

Note that, even when doing a relatively direct translation from the source language to its AST, the order of the nodes does not completely follow the order of the tokens in the input file. For example, for the source x + 3, the AST node for the + comes either before x and 3 (in a pre-order serialized form) or after both (in a post-order serialized form). This means that position diffs are often negative. A serialized format for positions should allow negative diffs as well as positive ones to be useful for an AST format. This is also exploited by the source maps format.

Sometimes, positions jump around arbitrarily. In particular, when considering code generated by a possibly optimizing compiler, two adjacent AST nodes can point to source positions in different files. It must therefore be possible to have absolute jumps as well as relative diffs.

There are also things that the source map format is not efficient at, if we consider an AST instead of textual source. We can improve on those aspects for a binary AST format.

First, since the format follows whole tokens in the generated .js, it assumes that each new position map entry (called a segment) is given for the next token, compared to the previous one. Since the next token can span an arbitrary number of characters, the source map format starts each segment with a character diff in the generated .js. When giving positions to AST nodes, 1 AST node almost always matches 1 or more tokens. Therefore, whereas adjacent characters would often be part of the same token and share their positions, it is not the case for adjacent AST nodes. Optimizing for "ranges of AST nodes" is therefore harmful. It is better to always imply 1 AST node = 1 stored position (segment). We therefore avoid storing the "character diff" that source maps need, improving size.

Second, source maps were by design restricted to a textual encoding, using only characters in the ASCII set, valid as unescaped characters in a JS string. This design restriction is legitimate for source maps of .js, which is text itself, but is an unjustified burden for source maps for a binary AST, which is binary anyway. We can therefore improve on source maps by storing things in a binary-efficient way, rather than a text-efficient way. In particular, this includes all 8 bits of each byte. Since parsing time is also a concern, we also want positions not to be "null"-terminated (for some definition of "null", such as , in source maps), but rather know their lengths from the beginning.

The format we used in the Scala.js IR

Based on all these considerations, we designed the format of serialized positions in the Scala.js IR as follows:

Each AST node is preceded by its encoded position (see rationale in the previous section).
The position is either relative to the previous one (in the same file), or absolute (including from another file), or "unknown". If relative and the previous one is unknown, it's relative to the node before the previous one, and so on until one is not unknown.
Smaller diffs are encoded with fewer bytes than larger diffs (with the expectation that smaller diffs are more common than larger diffs).
The length of the serialized position is known from the first byte (for faster parsing).

Concretely, this is the format we use for each position. The first byte is decomposed in bits, the next bytes in pseudo-hex (2 chars per byte):

1st byte	next bytes	description
`ccccccc0`		Column diff (7-bit signed)
`llllll01`	`CC`	Line diff (6-bit signed), column (8-bit unsigned)
`____0011`	`LL LL CC`	Line diff (16-bit signed), column (8-bit unsigned)
`____0111`	12 bytes	File index, line, column (all 32-bit signed)
`11111111`		Unknown position

Underscores are irrelevant and must be set to 0. The file index is an index to a constant table with URIs of all source files.

As you can see, we have the following sizes:

1 byte for a position on the same line, and within 64 characters left or right of the previous position (in source maps, even this is at least 3 bytes!).
2 bytes for a position on a line within 32 lines up or down the previous one
4 bytes for a position further away in the same file
13 bytes for a position in a different file (or very far away in the same file)

Note that there are some assumptions that we can make for .scala sources (which is very rarely a compiler target), that are probably not reasonable for .js (which is very often a compiler target). For example, we assume that line length does not exceed 255 characters (we have to use 13 bytes to represent any position further on a line), or that jumping around files is pretty rare.

To be clear, I am definitely not recommending that the exact same encoding be used for the Binary AST. Rather, I hope that some of the insights above can be useful in narrowing down the spectrum in which to look for the best encoding for the Binary AST.

Some suggestions for the Binary AST

As actually suggested by @Yoric on the HN thread, and I totally agree with this: store positions of all AST nodes further in the file, so that they can be ignored during parsing until they are needed. They should probably be batched per-function (so that, given a function, we can immediately jump to the serialized version of its nodes), which implies that the first AST node of each function always has an absolute position.
Use 16-bit unsigned integers for absolute column numbers, because lines could be quite long.
Use variable-length fields for the absolute position, because jumping around files might be more frequent in compiled .js than in source .scala.
And generally: perform benchmarks on slight variations of the format (what fits into how many bytes) on a corpus of actual codebases (both source JS and JS that is the result of some compilation) to find the best trade-off. The specific details shown above where the result on such a "benchmark" for .scala codebases.

HTH

tc39 / proposal-binary-ast