Debugfriendly output possible?

jdelker commented 4 years ago

I would like to use the decompiled java code for debugging purposes (e.g. tracing code execution in Netbeans), which basically requires two things:

output of original line numbers
fitting (stretching/combining) code lines to match actual lines in output file

Procyon does actually a pretty good job producing such output, but unfortunately fails in other areas. Thus, I would love to see this being possible in CFR.

@leibnitz27:

Is this something you would think is feasible with CFR in a foreseeable future?
Considering the current code and data structures, would you say this is something rather easily/hard thing to implement?

jdelker commented 4 years ago

This somewhat includes issue #72 - assuming the author's intention was also to use line numbers for debugging.

Lanchon commented 4 years ago

stretching/combining lines would detract a lot from readability, just when you need it most. plus the original jar would need to include line numbers, which in non-open source bytecode is rare. for these reasons, IMHO this feature is not a good idea.

from a usability perspective, a better spin at it would be having CFR reoutput the class files with added (or replaced) debugging info matching its output.

unfortunately this is very probably off the table because:

in general, it would require completely new functionality to output class files, meaning a lot of effort for low returns and major feature creep for a decompiler.
and specifically for CFR, i believe this decompiler uses its own class file reader instead of a library like ASM, meaning that writing class files would probably require rolling its own class writer library too, and thus would be a huge effort for this project in particular.

but there is a compromise:

assuming CFR's infrastructure allows establishing a relationship between bytecode addesses and line numbers of generated source (which may very well not be true), CFR could output maps of line numbers to bytecode addresses for each method body in some suitable format (standard, easy to parse). then someone else can easily create a tool based on ASM that can rewrite classes/jars and insert the debugging info from these mapping.

this is doable but requires new infrastructure in CFR to output the maps, and thus it is not completely ideal.

so here is an improvement:

maybe the least intrusive way would be for the author to define a custom parseable comment structure added at the end of some source lines that links this line (and next lines until another such comment) to a particular bytecode offset in the current method. this would require the least effort (no new infrastructure), and would still allow the creation of a bytecode rewriting tool.

if considering this solution, another custom parseable comment structure would be needed at the beginning of each method, linking the following bytecode offset comments to a specific bytecode method. this would accomplish two things:

it would complicate the rewriting tool immensely if it needed to parse the structure of the java program to find the appropriate bytecode method body into which the following mappings have to be injected. i am not familiar with the low level structure of class files, but something very low level such as a comment including Ljava/lang/String;.<init>()V to refer to the string constructor would be perfect. in fact, looking at the ASM API to see how signatures are expressed and using that syntax would help other tool authors.
it would free CFR to do all the class and method renaming stuff it wants to do and still be able to link back to the original code that has to be instrumented. (in fact, an option to only provide the method comments could also be valuable for debugging.)

jdelker commented 4 years ago

@Lanchon: Thanks for your detailed comment on this. I probably do not have enough background in java bytecode and decompiling techniques to fully follow your thoughts there. But I guess your bottom line is, that this is rather hard to implement within CFR and should better be solved by some post-processing (reformatting) task.

As said, my goal is to use java debugging features (with Netbeans) to be able to break code execution on particular lines or methods. So far this worked surprisingly good with most of the non-open java code I encountered so far, when using Procyon for decompile. Maybe I was very lucky with the bytecode I tried, but it generated almost perfectly formatted code for debugging. Unfortunately, it fails quite badly on some other things, due to internal bugs. Which brought CFR to my attention, as it generated very good code, too.

Considering the availability of original line number information in the bytecode, it should be more an issue of stretching, rather than combining - unless the code wasn't garbled in the first place. In a bottom line: That "debug formatting" may not be suitable for every bytecode. And debugging the decompiled code is meaningless anyway, if the original code wasn't formatted well and contains the proper line number information. But where that applies, the decompiler would help tremendously, if it reconstructs that formatting, too.

So, I've no clue how this could effectively be tackled within CFR. For me, it's this last, essential feature, which would make CFR the #1 choice.

leibnitz27 commented 4 years ago

So. @Lanchon has basically hit the nail on the head here. Long exposition follows, with TL;DR at the end.

There are a few points worth mentioning.

CFR has a bunch of normalisation passes that mutate bytecode graphs, duplicate instructions, deduplicate instructions, and basically make fairly aggressive changes while retaining the semantics.

This is one of the reasons why it actually behaves pretty well against obfuscated code (though I actually don't specifically target any obfuscators), it's more a question of 'ooh I wonder if I could cope with this'.

As such, the lines output often bear little resemblance to the ordering of the lines in the line number metadata table.

This can even come into play when not using aggressive normalisation - consider -

   0: iconst_0
   1: istore_1
   2: iload_1
   3: bipush        10
   5: if_icmpge     21
   8: getstatic     #2                  // Field java/lang/System.out:Ljava/io/PrintStream;
  11: iload_1
  12: invokevirtual #3                  // Method java/io/PrintStream.println:(I)V
  15: iinc          1, 1
  18: goto          2
  21: return

Was this a for loop? A while loop? (trick question - both a for and while loop generate this, if you format them appropriately).

At this point, you could say 'hang on, using the line numbers would actually help you tell the difference'. That's true...

Decompilation quality + speed.

If we add a constraint to only allow normalisations to work if they preserve line number ordering, this is going to have to be enforced.

Any normalisation that can't be used will reduce CFR's ability to produce readable code.

But (and this is a pretty big deal for me) - CFR works by performing (sorry haskell folk) multiple graph mutations. This means that if we now add an invariant that line number ordering is preserved (and partial ordering of bytecode are only valid if they preserve this), then I'll need to add a clone, mutate, rewind, or change the internals to use a non-mutating approach. That's painful (and not cheap).

Coverage. (this is less important, but worth mentioning)

An output is only useful if it gets tested - realistically considering line numbers adds a whole new dimension to the problem.

Principle of least trust.

I mention this in my FAQ - but I have made a lot of effort to avoid trusting any part of a class file which could conceivably be populated with lies.

That's why I don't use the LocalVariableTypeTable (which could significantly reduce the complexity of CFR's type inference).

LineNumberTable is clearly one of these.

A decompiler isn't a debugger (I'm potentially out of step with the world here, but hear me out).

The principle of least surprise is such that people, when given a decompiler that gets line numbers right 90% of the time, will get confused/annoyed by the times when the decompiler doesn't get it spot on.

That x++ in your decompiled code that didn't do anything? WTF? Oh, it takes effect 3 lines later? WHAT?!

I know people use decompilers and debuggers all the time, (heck, I'm a hypocrite), but people should really be careful to not over-rely on these. They're not perfect, and the point at which you're trying to use a debugger is exactly the point at which you want (or need) them to be perfect.

Stretching really is lame (I had to say it).

Yep, sorry, but the naive 'hey, I'm done here, now can we squash/stretch to match' really is terrible. That's to say, this cannot be treated as a post processing step, it would need to be integral.

TL;DR.

I think @Lanchon 's suggestion has some legs - provide a way of emitting the line number that the byte code (supposedly) came from, and provide some help with tooling to allow this to be retrofitted - however I suspect that the additional effort involved in using this would put most people off.

Lanchon commented 4 years ago

thanks for all the info!

provide a way of emitting the line number that the byte code (supposedly) came from

well, i suggested this if it was low hanging fruit: ie, the mapping info was already mostly there in the IR from which you are generating the output. per your description it looks like a concern that cuts through a lot of the work you have already done: class reader, IR(s), and graph transforms.

it sounds like the amount of work outweights the value provided by the feature, so unless there is something about this task that emotionally draws you in, you should probably leave it out.

leibnitz27 commented 4 years ago

FWIW - I think this is something that would have it's purpose better served with an IDE specific plugin.

It's a large effort (and as described above, not foolproof, which is my main concern) to try to get a decompiler to match the line numbers perfectly.

However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.

An IDE specific decompiler plugin could then load the classfile, but get the line number table from CFR at the same time it gets the decompilation.

If I had to guess, I'd say JetBrains' Idea would be the easiest to demonstrate a proof of concept here.

This would be fairly nice, as it should enable near perfect matching of bytecode/text.

jdelker commented 4 years ago

Well, my particular use case (using the decompiled java bytecode to trace execution) may be quite special and I understand that it does not go along nicely with the concepts and principles you mentioned above. Unfortunately, I can't avoid the necessity doing that, so I've to find the best decompiler matching the requirements. Maybe CFR is just to good in code reassembly, that it can not contain original formatting ;).

The more logic CFR applies to parsing that bytecode and building some smart code constructs, the harder it get's "squeezing" this in it's original shape. That fact is probably inevitable and giving up decompile qualitity in favor of proper formatting is probably not a wise choice.

The idea with a "IDE Plugin" sound's intersting. I've not touched any UI plugin development (particularly in Netbeans) yet.

Lanchon commented 4 years ago

IDEA is amazing. the only problem with it is: https://youtrack.jetbrains.com/issue/IDEA-225700 :)

It's a large effort [...] to try to get a decompiler to match the line numbers perfectly. However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.

well i don't follow if it is simple to provide some output or not.

jdelker commented 4 years ago

However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.

Is that something, the current code is already providing somehow (by method or lookup map?), or does this need to be implemented first?

leibnitz27 commented 4 years ago

No, it will require some non-trivial surgery. It's possible though. I have some concerns about the cost of keeping track of originator bytecode offsets, and quite how it would work when multiple lines of (disjoint) bytecode are combined into a single line.

It's interesting though, and much more realistic than trying to match the original LineNumberTable, so I'll give it a go.

Lanchon commented 4 years ago

worse for our purposes is when some bytecode impacts more than one source line.

so I'll give it a go

oops... i think i wouldn't. just remember to try to make the best use of your time.

modmuss50 commented 4 years ago

We are currenly using a fork of the decompiler intelij uses (fernflower) to output a line number map file.

Exaple of the line map file:

package/class
    74  75
    64  65
    69  70
    74  75
    19  20
    22  23
    24  24

Our gradle plugin then uses LineNumberRemapper.java to create a jar with the remapped line numbers, that is then ran from the ide.

We did consider using CFR at the time, but this was one of the major sticking points. Having this as an option would be awesome, as fernflower isnt ideal.

I think remapping the line numbers is better soultion than forcing the source to match the input classes. CFR could have an optional argument to export a remapped class/jar.

Lanchon commented 4 years ago

amazing, there's a gradle task to do the remapping already!

CFR could have an optional argument to export a remapped class/jar.

don't think so because AFAIK CFR doesn't use ASM and its home grown library only reads class files.

but it could output a line map compatible with fernflower :)

Lanchon commented 4 years ago

could you maybe post a real FF-generated line map file for reference?

modmuss50 commented 4 years ago

Our gradle implimentation isnt something to copy, it has a few major flaws that need working out. But it would definally be possible to use gradle to automate the whole process.

The linemaps we have are quite large (1619KB), I have included a direct download link, as well as a gist.

minecraft-1.15.1-mapped-net.fabricmc.yarn-1.15.1+build.24-v2-sources.lmap

gist (large)

modmuss50 commented 4 years ago

I should make it clear, this is from a fork of FF, not the real thing. I wouldnt worry about trying to make anything compatible with what we have, its all a bit of a hack.

Lanchon commented 4 years ago

I should make it clear, this is from a fork of FF

ah ok... :-/

thanks!

leibnitz27 commented 4 years ago

I definitely would not intend to generate a new class file or jar - I don't think (this is a matter of opinion, but I have a pretty firm one) that investigating something by altering it is good practice.

It's significantly simpler (conceptually, though goodness knows how IDEA etc will play, but I am somewhat optimistic) to just say 'hey, here's the source text and here's the line number map you WOULD have read from the classfile'.

leibnitz27 commented 4 years ago

While I don't advise anyone uses it (I expect to make it available via API, and thence to IDE plugins), it's possible to get reasonable REAL mappings now ;)

C:\code\cfr\target\classes>java org.benf.cfr.reader.Main C:\code\cfr_tests\output\java_8\org\benf\cfr\tests\LambdaTest6.class  --trackbytecodeloc true
/*
 * Decompiled with CFR 0.151-SNAPSHOT (b266dda).
 */
package org.benf.cfr.tests;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.function.Function;

public class LambdaTest6 {
    static <T, R> List<R> map(Function<T, R> function, List<T> source) {
        ArrayList<R> destiny = new ArrayList<R>();
        for (T item : source) {
            R value = function.apply(item);
            destiny.add(value);
        }
        return destiny;
    }

    public void test() {
        List<String> digits = Arrays.asList("1", "2", "3", "4", "5");
        List<Integer> numbers = LambdaTest6.map(Integer::new, digits);
    }
}
------------------
Line number table:

test()
----------
Line 22 : 0
Line 23 : 33

map(java.util.function.Function<T, R> java.util.List<T> )
----------
Line 13 : 3
Line 14 : 8
Line 15 : 32
Line 16 : 42
Line 18 : 54

modmuss50 commented 4 years ago

Thats great news!

I took a peak at using it via the api and saw it didnt seem quite ready yet, especially with the SinkReturns.LineNumberMapping_DO_NOT_USE class name ;)

Whats there of the api seems easy enough, would just need a wayt to get the classname from it.

Awesome work, and im happy to be a guinea pig to help test anything.

leibnitz27 commented 4 years ago

Hm. I'm a little saddened by how IDEA handles decompiled line numbers - unless I read it wrong, it requires you to provide a mapping from the line numbers in the LineNumberTable (which could be missing, or could be lies) and the decompiled line numbers.

This means it's not possible to get IDEA to behave nicely if line table is stripped, and an extra correlation is required to match up before/after lines.

Oh well. Still possible to do SOMETHING (screenshot of internal state of an idea plugin I just threw together)

can be used in a nice plugin to get this sort of thing working. Getting there ;)

modmuss50 commented 4 years ago

Looks good, I have had a go at supporting this in our gradle plugin here: https://github.com/FabricMC/fabric-loom/pull/248/ Just to need to figure out why getClassFileMappings doesnt seem to have a source line for all decompiled lines. (Im prob doing something wrong). Ive ran out of time to look into it much now.

hengyunabc commented 3 years ago

👍 Supported in arthas: https://arthas.aliyun.com/doc/en/jad

But the output seems not to be sorted by line numbers:

[arthas@61148]$ jad java.lang.String '<init>'

ClassLoader:

Location:

        public String(byte[] byArray) {
/*556*/     this(byArray, 0, byArray.length);
        }
        public String(byte[] byArray, int n, int n2) {
/*535*/     String.checkBounds(byArray, n, n2);
/*536*/     this.value = StringCoding.decode(byArray, n, n2);
        }
        public String(byte[] byArray, Charset charset) {
/*505*/     this(byArray, 0, byArray.length, charset);
        }
        public String(byte[] byArray, String string) throws UnsupportedEncodingException {
/*481*/     this(byArray, 0, byArray.length, string);
        }
        public String(byte[] byArray, int n, int n2, Charset charset) {
/*450*/     if (charset == null) {
                throw new NullPointerException("charset");
            }
/*452*/     String.checkBounds(byArray, n, n2);
/*453*/     this.value = StringCoding.decode(charset, byArray, n, n2);
        }

nbauma109 commented 2 years ago

👍 Supported in arthas: https://arthas.aliyun.com/doc/en/jad

@hengyunabc : I borrowed your code for the ecd (Enhanced Class Decompiler Plugin for Eclipse) https://github.com/alibaba/arthas/blob/931ce392fdc6bf675bbc3997917079c9ce3c9cb2/core/src/main/java/com/taobao/arthas/core/util/Decompiler.java#L34

But the output seems not to be sorted by line numbers:

Yes, I guess it's the order in the bytecode. I think an option could be created to sort the members (i.e. inner classes and methods) in the AST before generating the output ? It could be either bytecode order (default) or source code order.

@leibnitz27 : does this make sense to you ? I'd like to avoid to re-parse the output into an AST to re-order and re-output.

patric-r commented 2 years ago

+1 having an original-line-number-aligned decompilation output is a must for debugging 3rd-party-libraries (without source code) efficiently.

nbauma109 commented 2 years ago

You may want to check https://github.com/nbauma109/ecd for a realigned output of CFR

patric-r commented 2 years ago

@nbauma109 don't get me wrong and maybe I'm a little bit too cautious, but a "1 star" github is not really trustworthy (yet). Why have you created the fork from https://github.com/ecd-plugin/ecd and what is the difference?

How can your eclipse plugin realign the line numbers if CFR does not provide any line numbers in its decompilation output?

nbauma109 commented 2 years ago

CFR provides line number mappings in the latest release. The forked version of ecd produces line numbers as comments based on that mapping and the existing post-process of ecd realigns line numbers using the parsed line number comments. As mentioned above, a post-process task is not ideal at all but I find it better than nothing. As for your comment on popularity and trust, I'd like to remind that original ecd used to be in the top rankings of eclipse marketplace but was a whole spyware machine with privacy violating contents that other people have made a good job removing. Pretty ironical.

leibnitz27 commented 2 years ago

I've commented pretty extensively on this above - line numbers per se don't really make sense - you cant take this output and expect it to line up perfectly with the original code (even with stretching and squashing). The quality will always be poor, with lots of confusing jumps etc. (Yes, i know its been done reasonably, you can do an adequate job, but you'll never do a great job)

The sensible thing to do (anyone know someone at jetbrains? :) ) is to decompile, then tweak the line number table to match the new code, such that the bytecode locations the debugger has map to the rebuilt output.

As nbauma109 says - Cfr supports generating a bytecode map (most entities output contain information about where they are in the bytecode), so effectively generates a NEW line number table.

(This maps back to modmus' comment, but it would be so much cleaner to do it as an extra input to the debugger)

One day.......

On Tue, 9 Aug 2022, 15:41 nbauma109, @.***> wrote:

CFR provides line number mappings in the latest release. The forked version of ecd produces line numbers as comments and the existing post-process of ecd realigns line numbers using eclipse AST. As mentioned above, a post-process task is not ideal at all but I find it better than nothing. As per your comment on popularity and trust, I'd like to remind that original ecd used to be in the top rankings of eclipse marketplace but was a whole spyware machine with privacy violating contents that other people have made a good job removing. Pretty ironical.

— Reply to this email directly, view it on GitHub https://github.com/leibnitz27/cfr/issues/73#issuecomment-1209471765, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFXCEE57O2GW53UVRMABJ3VYJUZJANCNFSM4KCMTNJQ . You are receiving this because you were mentioned.Message ID: @.***>

nbauma109 commented 2 years ago

Intellij uses fernflower decompiler which doesn't support line number realignment. However the debugger is able to point you to the right line in the code. So the debugger is somehow able to process the line mappings provided by fernflower. If Jetbrains has 'opened' the possibility to provide a decompiler implementation, we can possibly debug with any decompiler which provides line number mappings.

leibnitz27 / cfr

Debugfriendly output possible? #73