oplines: precise error locations, less useless COPs

rurban commented 6 years ago

See https://news.ycombinator.com/item?id=15185383

Looking up COP's to find the error lines is inexact, slow and esp. 8% heavier than using less COP's and moving the opline to each op. I had a oplines branch a couple of years ago. Just need to add the warning and error cases.

rocky commented 6 years ago

I'd like to get a little more clarity here. There are two separable ideas.

First, you don' t need an opcode store in the op tree, but can get by with just a source-to-tree location mapping (inappropriately called a "line-number table"). That could be stored as a separate structure outside the op tree. With this, the tree is smaller (by 8% you say?) and execution is a little bit faster.

It is only when there is an error, caller() is invoked, or you are debugging that you need to get this access to this information, which is pretty rare. There are additional steps needed to find the association, but since the need is rare, the additonal overhead is not worth worrying about, and is more than compensated by the speedup in the normal non-error/debug situations.

The second idea that seems to be mentioned above is somthing recenlty observed: with a bit of work, you can reconstruct equivalent source code from just the opcode tree. For Perl specifically, it is eerily the same most of the time. And by using the tree you can often get a more precise idea of what's up for execution based on the position in op tree.

These two things though are independent, although the two can be blended and probably should be.

With respect to decompiling in Perl, I'll be giving a talk on this in the upcoming Glasgow YAPC . In preparation I just and spent some time looking at both CPerl's and Perl's B::Deparse module. It is:

complete in scope and in test coverage
large and monolithic. (One huge file of 6.K lines with two huge test-cases files)
pretty well maintained
cumbersome, and...
not exactly what is needed for error reporting.

To that last point, that's why I wrote B::DeparseTree.

In preparation for the talk I looked at adapting it to handle CPerl 5.26.2, or getting greater coverage than what I had previously done for vanilla Perl 5.26.2. It is going to be a lot of work. So I better get back to it.

rurban commented 6 years ago

Hi Rocky, Yes I'm aware that we could store an extra line-number table to associate an op with file+line. We did that with perl6. In parrot it was called annotations. It mimics stabs and dwarf a bit.

The 8% number comes from my implementation to remove unneeded COP's and store the opline in each op. Then you only need COP's at the beginning of each new filescope, as jump targets and when introducing and ending new scope. Currently each new line and semicolon needs a new COP, which is slow.

optree introspection and deparsing is already a supported feature. With optimizations going on, and perl and cperl are adding more and more, that exact source will not be deconstructable, but an equivalent source, which is enough. The importance of inlining and loop unrolling is higher than to reconstruct the fully equivalent source, but if there are free bits on the ops it will be done to help in this goal.

For error reporting the missing bits are in the lexer and parser. There are still wrong lines being reported. Even if not deparsed. This should be pretty easy to fix, but in the last 20 years nobody had time to do it. Thanks for your YAPC talk, but I'm afraid I will not be able to make it there. I will watch it on youtube later.

rocky commented 6 years ago

With optimizations going on, and perl and cperl are adding more and more, that exact source will not be deconstructable, but an equivalent source, which is enough.

Sure, that's fine. However I'd ask one other thing. It would be nice to have (at least on option demand) a transaction log of what optimizations were performed on the stream in reverse delta standpoint. ("reverse delta" is the way is GNU Emacs AntiNews is described, or the way that the old revision-control system RCS first introduced delta over the older SCCS and is currently how all version-control deltas work.) Here is a fictious program and log

my ($x, $y = (2, 3)         # 1
my @z;                      # 2 
if ($x + $y eq 5) {         # 3
   $z [0] = 2;              # 4
} else {                    # 5
  $z[0] = 3                 # 6
}                           # 7

lines 5-7: dead code removal: else { $z[0] = 3 }
line 3: if optimization: condition 1 removed
line 3: expession folded: 1 <- 5 eq 5
line 3: constant folded: 5 <- 2 + 3
line 3: constant propagated 3 <- $y
line 3: constant propagated 2 <- $x

The line numbers are more to get the idea, than what might really be used which would be an instruction offset or some way to mark the instruction as its position moves around.

The way to think of this is as the compiler writing to a transaction journal as it proceeds.

Also, or alternatively (on a switch) some summary information could be stored in each instruction. (As a side table is okay). For example on. Let's say there is no "if optimizaion" on line 3 but we've done the constant folding to 5, and the instruction to load 5 is:

SVOP (0x55f89e64b0a8) const  IV (0x55f89e642d98) 5

There could be a flag bit that indicates this instruction is the result of an constant folded optimization. Other flags might be a flag on an instruction due to getting moved, flag bits on instructions that were added as a result of loop unrolling, and so on. No flag bits for an instruction indicates it was part of the original source stream.

A program such as a debugger, or a poor programmer would make use of this to descramble the resulting equivalent program, and warn which parts might be funky due to optimization and which parts aren't.

I'd be happy to split the above off and log it as a separate feature request. Obviously this is a low-priority thing. Like this issue and idea, having this is jotted down is there to get the idea out there in the off chance that someone is interested and has the time.

For error reporting the missing bits are in the lexer and parser. There are still wrong lines being reported

In my opinion on working on B::Deparse.pm is its hugeness and monolithic-ness, its lack of high-level documentation on how it works, and monolitic test cases. 6.5K lines in a single file, really? Yeah that is slim compared to perl5db.pl which is 10.5K, Perl is not going to win friends with code that's like this.

It's not the 6.k or 10.5K, the problem demands that. It's the fact that it is one file. By comparison Devel::Trepan is 16.6K of just code (or 25K if you count comments and blank lines). However that's spread over 178 files which averages about 150 lines per file. So largeness is less an issue if the code is more modular and broken down.

And the tests are likewise large and monolithic. Yeah, it is cool that Perl can read data from its own script, but when the tests get to be in the hundreds, it is best to put it in a separate file or files segregating simple and complicated test cases, specific bugs, and add a way to specify point the test to your own. I will probably do that for DeparseTree soon. And by doing this testing can grow even larger and be more complete. For example, why not deparse and run the resulting deparse on Perl's entire test suite. (That's what I attempt to do in Python.)

Thanks for your YAPC talk, but I'm afraid I will not be able to make it there. I will watch it on youtube later.

I've not had much success at having my talks recorded. However there will definitely will be slides and notes for the slides.

The most complete jotted down ideas I have on decompiling are here. It hasn't gotten much review, so I would welcome comments. Perl's decompiler since it works off a Tree already is a little bit different. I should write a separate thing covering that and how it relates to the pipeline presented in that paper. I guess I'll have that done by the time Glasgow rolls around. There currently is very definite material that could be presented. However which material exactly will, alas, gets decided at the last minute.

perl11 / cperl

oplines: precise error locations, less useless COPs #333