Open alandonovan opened 6 years ago
We should do this for the new indexed format. The old formats are not produced anymore .
There is no golang.org/x/tools/go/internal/gcimporter/iexport.go, so go/types today can only export in the $$B format. Should I file a separate issue to port iexport.go to go/types?
@alandonovan Yes, please. I think this would have to be an x/tools issue. The std lib doesn't support this mechanism even for the $BB format. (And I am about to remove the support for importing the $BB format from the std lib's gcimporter.)
I think the first step is deciding how to encode position information.
iexport inherited the optimized position encoding from bexport, which used delta encoding for line numbers within the same source file, and then an escape sequence for file:line pairs in another file.
Off hand, I can imagine a handful of different ways we might try encoding file:line:column deltas, but no clear idea of which would be best. Simplest would probably be to continue using the existing scheme for the "file:line" delta, and then an extra int64() for a separate "column" delta.
Does DWARF have any clever (and hopefully not overly clever) way of efficiently encoding file:line:column tuples?
go/internal/gcimporter.fakeFileSet will need work too, to be able to return positions with columns:
https://github.com/golang/go/blob/master/src/go/internal/gcimporter/bimport.go#L340-L366
The cmd/compile side of things looks straight forward though.
For Starlark's bytecode format I found delta encoding of columns worked nicely. When encoding a sequence of file positions, column numbers tend to increase in small increments within each line. Even across newlines there is lots of locality because all the statements in a loop body (say) are similarly indented. See: https://github.com/google/starlark-go/blob/master/internal/compile/compile.go#L430
Change https://golang.org/cl/196963 mentions this issue: cmd/compile: add column details to export data
CL 196963 naively adds delta-encoded column position information, and it appears to grow export data by about 9%:
Binary size is out of control. Please don't do this until there is genuine progress on reining in the bloat.
Binary size is out of control. Please don't do this until there is genuine progress on reining in the bloat.
This is export data size. Binary size is unaffected.
(Edit: Binary size might be affected if this means DWARF debugging information grows from having more precise column information for inlined function bodies. I meant though that I wasn't measuring binary size.)
Latest revision of CL 196963 uses a better encoding for adding column details, so it only increases export data size by about 4% on (geometric) average.
Also, as mentioned in the commit message, it actually shrinks k8s.io/kubernetes/cmd/kubelet's binary size by 24kB, but I haven't been able to explain why exactly. The best I've tracked it down so far is that it shrinks the .rodata and .debug_zinfo sections for some reason.
Change https://golang.org/cl/197678 mentions this issue: go/internal/gcimporter: support reading column details from export data
Okay, cmd/compile is now exporting column position, and gcimporter knows how to read that extra information.
However, fakeFileSet is currently dropping that data on the floor when it converts it into token.Pos. So someone needs to work on a CL to extend fakeFileSet.
For what it's worth, cmd/compile currently only records 8 bits of column data:
So the naive solution would be to just reserve 256x as much token.Pos space, but I'm worried that's not an appealing option on 32-bit platforms (where token.Pos is 32-bits): reserving maxlines (64k) * maxcolumns (256) per referenced file is 24 bits. So 32-bit CPUs wouldn't be able import packages that reference more than 256 source files. (Maybe only 128 because of sign?)
Aside: In retrospect, token.Pos probably should have been an int64 instead of int, since it refers to file offsets, not memory offsets. Can't change that now though.
I wonder: could we change token.Pos to int64? Technically it's a breaking change, but these values are never treated as integers (aside from zero); they're more like opaque pointers into a virtual address space defined by a FileSet.
It would double its space usage, which could be a concern.
Two thoughts:
Is it okay if behavior differs between 32-bit and 64-bit CPUs? E.g., can we use a lossier token.Pos encoding so that 32-bit CPUs can still reference more files at once?
One option is to use a split encoding scheme where we reserve column Pos-space for the first N lines (e.g., 3k); and then past that we fallback to only returning column offset. (That would still increase Pos-space usage by 12x though.)
Looking at Kubernetes source code there are files as large as 161,105 lines, but the 99%ile line count is ~2700 lines.
I wonder: could we change token.Pos to int64?
Maybe? Looks like as precedent we changed math/big.Word from uintptr
(since Go 1) to uint
in Go 1.9.
The primary reason for the choice of uint32 was the use of token.Pos values in the index of the (original) godoc. Not sure how much of that code we're still using and whether it matters. Increasing to 64bit will significantly increase (roughly double) the index size, which used to be in memory.
@adonovan : why?
@robpike: why increase the width of token.Pos? @mdempsky suggested it above as a possible workaround to the problem of creating token.Pos values when parsing export data. See https://golang.org/src/go/internal/gcimporter/bimport.go#L340. The issue is that token.File contains a table of offsets of newlines within the source file, so that it can map byte offsets to line/column pairs. We don't have that information for source files when we load export data, so we currently use a fake table corresponding to a file containing only 64K newlines. The line number is simply the offset, and the column is zero. If we instead built a fake table corresponding to a file containing 64K lines each of 256 spaces, then we could represent any line < 64K and any column < 256 accurately. However, that uses up a lot of token.Pos "address space" for each file.
Thanks, @adonovan but I still don't see why that needs a 64-bit integer.
A token.FileSet is conceptually a concatenation of token.Files: it provides a number (token.Pos) for every index in the concatenation, a bit like a bunch of files memory-mapped adjacently in virtual address space. Each File usually reserves a block of this concatenation equal to its length. However, for files described by export data, we don't know the length of the file (or even the highest offset referenced by a token.Pos in the export data), so we would reserve a range of offsets for them equivalent to a 64KLoC file containing lines of 256 spaces, allowing us to compute the Pos for any line/col pair in that range, even though of course most aren't needed. (The actual fake table need not be materialized, though today it is, as a shared constant.)
However, a FileSet is typically a global variable of an application, so an app that reads hundreds of files from export data within its lifetime will keep appending these 24-bit-sized chunks to the mapping, which allows only 1<<(32-24) = 256 iterations before we run out.
Besides increasing the "address space", two alternative approaches would be to encode the line offset table of each file, which would compress well but nonetheless occupy more space in the export data and in memory after parsing, or, to have the export data specify the maximum significant line and column numbers in that file, in effect still giving a rectangular overapproximation of the true raggedy shape of the file, but using a smaller rectangle. This latter approach would have almost no space overhead, but may be trickier to encode.
Got it, thanks for taking the time to explain it.
I must say though, so much tech and space and overhead for "column" information doesn't feel like the right call to me, but fortunately it's not up to me.
FileSet is the problem, really. It does a great job of making the marginal cost of a position very small (32 bits), but it imposes a number of other costs, such as a monotonically increasing memory footprint, the need for an application-wide global, and the inability to easily synthesize arbitrary positions.
If we were to design the syntax tree anew, I would look more carefully at simpler alternatives like (at the cost of another word per node) putting a reference to the token.File into every syntax node so that each node can report its position without the need for an external table, or, even simpler, putting just a pointer to a shared filename string in every node, and encoding the line and column of each token into the high 24 and low 8 bits of a uint32.
For a bit of perspective: The reason for the FileSet data structure is historic. The search index for the original godoc needed a dense encoding of positions (to save space), which is where the 32bit encoding used for token.Pos is coming from. It worked well for that application but unfortunately, the popularity of the go/ast lead to the pervasiveness of token.Pos(itions) and with that the need to carry around FileSets which are cumbersome to manipulate.
The compiler (cmd/compile) uses a more modern AST version (cmd/compile/internal/syntax) which has a new position representation. Each position is "stand-alone" but also significantly larger (two 64bit words); in essence it's one more word per position.
I'd love to replace FileSet and token.Pos but until modules are a wide-spread default we can't do that. Once modules are widely adopted, we can start making changes.
None of the compiler's historical or present export data formats ($$, $$B, indexed) support column information, even though the compiler and the go/{token,ast,types}) packages are capable of producing and consuming this information. We should add it to the export data file using a suitably efficient encoding.
Otherwise various analysis tools will continue to report either the correct column, or column 1, depending on whether they are reporting the position of an object loaded from source or export data.
@ianthehat