Open jdelker opened 4 years ago
This somewhat includes issue #72 - assuming the author's intention was also to use line numbers for debugging.
stretching/combining lines would detract a lot from readability, just when you need it most. plus the original jar would need to include line numbers, which in non-open source bytecode is rare. for these reasons, IMHO this feature is not a good idea.
from a usability perspective, a better spin at it would be having CFR reoutput the class files with added (or replaced) debugging info matching its output.
unfortunately this is very probably off the table because:
but there is a compromise:
assuming CFR's infrastructure allows establishing a relationship between bytecode addesses and line numbers of generated source (which may very well not be true), CFR could output maps of line numbers to bytecode addresses for each method body in some suitable format (standard, easy to parse). then someone else can easily create a tool based on ASM that can rewrite classes/jars and insert the debugging info from these mapping.
this is doable but requires new infrastructure in CFR to output the maps, and thus it is not completely ideal.
so here is an improvement:
maybe the least intrusive way would be for the author to define a custom parseable comment structure added at the end of some source lines that links this line (and next lines until another such comment) to a particular bytecode offset in the current method. this would require the least effort (no new infrastructure), and would still allow the creation of a bytecode rewriting tool.
if considering this solution, another custom parseable comment structure would be needed at the beginning of each method, linking the following bytecode offset comments to a specific bytecode method. this would accomplish two things:
Ljava/lang/String;.<init>()V
to refer to the string constructor would be perfect. in fact, looking at the ASM API to see how signatures are expressed and using that syntax would help other tool authors.@Lanchon: Thanks for your detailed comment on this. I probably do not have enough background in java bytecode and decompiling techniques to fully follow your thoughts there. But I guess your bottom line is, that this is rather hard to implement within CFR and should better be solved by some post-processing (reformatting) task.
As said, my goal is to use java debugging features (with Netbeans) to be able to break code execution on particular lines or methods. So far this worked surprisingly good with most of the non-open java code I encountered so far, when using Procyon for decompile. Maybe I was very lucky with the bytecode I tried, but it generated almost perfectly formatted code for debugging. Unfortunately, it fails quite badly on some other things, due to internal bugs. Which brought CFR to my attention, as it generated very good code, too.
Considering the availability of original line number information in the bytecode, it should be more an issue of stretching, rather than combining - unless the code wasn't garbled in the first place. In a bottom line: That "debug formatting" may not be suitable for every bytecode. And debugging the decompiled code is meaningless anyway, if the original code wasn't formatted well and contains the proper line number information. But where that applies, the decompiler would help tremendously, if it reconstructs that formatting, too.
So, I've no clue how this could effectively be tackled within CFR. For me, it's this last, essential feature, which would make CFR the #1 choice.
So. @Lanchon has basically hit the nail on the head here. Long exposition follows, with TL;DR at the end.
There are a few points worth mentioning.
CFR has a bunch of normalisation passes that mutate bytecode graphs, duplicate instructions, deduplicate instructions, and basically make fairly aggressive changes while retaining the semantics.
This is one of the reasons why it actually behaves pretty well against obfuscated code (though I actually don't specifically target any obfuscators), it's more a question of 'ooh I wonder if I could cope with this'.
As such, the lines output often bear little resemblance to the ordering of the lines in the line number metadata table.
This can even come into play when not using aggressive normalisation - consider -
0: iconst_0
1: istore_1
2: iload_1
3: bipush 10
5: if_icmpge 21
8: getstatic #2 // Field java/lang/System.out:Ljava/io/PrintStream;
11: iload_1
12: invokevirtual #3 // Method java/io/PrintStream.println:(I)V
15: iinc 1, 1
18: goto 2
21: return
Was this a for loop? A while loop? (trick question - both a for and while loop generate this, if you format them appropriately).
At this point, you could say 'hang on, using the line numbers would actually help you tell the difference'. That's true...
If we add a constraint to only allow normalisations to work if they preserve line number ordering, this is going to have to be enforced.
Any normalisation that can't be used will reduce CFR's ability to produce readable code.
But (and this is a pretty big deal for me) - CFR works by performing (sorry haskell folk) multiple graph mutations. This means that if we now add an invariant that line number ordering is preserved (and partial ordering of bytecode are only valid if they preserve this), then I'll need to add a clone, mutate, rewind, or change the internals to use a non-mutating approach. That's painful (and not cheap).
An output is only useful if it gets tested - realistically considering line numbers adds a whole new dimension to the problem.
I mention this in my FAQ - but I have made a lot of effort to avoid trusting any part of a class file which could conceivably be populated with lies.
That's why I don't use the LocalVariableTypeTable (which could significantly reduce the complexity of CFR's type inference).
LineNumberTable is clearly one of these.
The principle of least surprise is such that people, when given a decompiler that gets line numbers right 90% of the time, will get confused/annoyed by the times when the decompiler doesn't get it spot on.
That x++ in your decompiled code that didn't do anything? WTF? Oh, it takes effect 3 lines later? WHAT?!
I know people use decompilers and debuggers all the time, (heck, I'm a hypocrite), but people should really be careful to not over-rely on these. They're not perfect, and the point at which you're trying to use a debugger is exactly the point at which you want (or need) them to be perfect.
Yep, sorry, but the naive 'hey, I'm done here, now can we squash/stretch to match' really is terrible. That's to say, this cannot be treated as a post processing step, it would need to be integral.
TL;DR.
I think @Lanchon 's suggestion has some legs - provide a way of emitting the line number that the byte code (supposedly) came from, and provide some help with tooling to allow this to be retrofitted - however I suspect that the additional effort involved in using this would put most people off.
thanks for all the info!
provide a way of emitting the line number that the byte code (supposedly) came from
well, i suggested this if it was low hanging fruit: ie, the mapping info was already mostly there in the IR from which you are generating the output. per your description it looks like a concern that cuts through a lot of the work you have already done: class reader, IR(s), and graph transforms.
it sounds like the amount of work outweights the value provided by the feature, so unless there is something about this task that emotionally draws you in, you should probably leave it out.
FWIW - I think this is something that would have it's purpose better served with an IDE specific plugin.
It's a large effort (and as described above, not foolproof, which is my main concern) to try to get a decompiler to match the line numbers perfectly.
However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.
An IDE specific decompiler plugin could then load the classfile, but get the line number table from CFR at the same time it gets the decompilation.
If I had to guess, I'd say JetBrains' Idea would be the easiest to demonstrate a proof of concept here.
This would be fairly nice, as it should enable near perfect matching of bytecode/text.
Well, my particular use case (using the decompiled java bytecode to trace execution) may be quite special and I understand that it does not go along nicely with the concepts and principles you mentioned above. Unfortunately, I can't avoid the necessity doing that, so I've to find the best decompiler matching the requirements. Maybe CFR is just to good in code reassembly, that it can not contain original formatting ;).
The more logic CFR applies to parsing that bytecode and building some smart code constructs, the harder it get's "squeezing" this in it's original shape. That fact is probably inevitable and giving up decompile qualitity in favor of proper formatting is probably not a wise choice.
The idea with a "IDE Plugin" sound's intersting. I've not touched any UI plugin development (particularly in Netbeans) yet.
IDEA is amazing. the only problem with it is: https://youtrack.jetbrains.com/issue/IDEA-225700 :)
It's a large effort [...] to try to get a decompiler to match the line numbers perfectly. However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.
well i don't follow if it is simple to provide some output or not.
However, it's fairly simple to say 'the instruction at bytecode XX is being OUTPUT on line YY'.
Is that something, the current code is already providing somehow (by method or lookup map?), or does this need to be implemented first?
No, it will require some non-trivial surgery. It's possible though. I have some concerns about the cost of keeping track of originator bytecode offsets, and quite how it would work when multiple lines of (disjoint) bytecode are combined into a single line.
It's interesting though, and much more realistic than trying to match the original LineNumberTable, so I'll give it a go.
worse for our purposes is when some bytecode impacts more than one source line.
so I'll give it a go
oops... i think i wouldn't. just remember to try to make the best use of your time.
We are currenly using a fork of the decompiler intelij uses (fernflower) to output a line number map file.
Exaple of the line map file:
package/class
74 75
64 65
69 70
74 75
19 20
22 23
24 24
Our gradle plugin then uses LineNumberRemapper.java to create a jar with the remapped line numbers, that is then ran from the ide.
We did consider using CFR at the time, but this was one of the major sticking points. Having this as an option would be awesome, as fernflower isnt ideal.
I think remapping the line numbers is better soultion than forcing the source to match the input classes. CFR could have an optional argument to export a remapped class/jar.
amazing, there's a gradle task to do the remapping already!
CFR could have an optional argument to export a remapped class/jar.
don't think so because AFAIK CFR doesn't use ASM and its home grown library only reads class files.
but it could output a line map compatible with fernflower :)
could you maybe post a real FF-generated line map file for reference?
Our gradle implimentation isnt something to copy, it has a few major flaws that need working out. But it would definally be possible to use gradle to automate the whole process.
The linemaps we have are quite large (1619KB), I have included a direct download link, as well as a gist.
minecraft-1.15.1-mapped-net.fabricmc.yarn-1.15.1+build.24-v2-sources.lmap
I should make it clear, this is from a fork of FF, not the real thing. I wouldnt worry about trying to make anything compatible with what we have, its all a bit of a hack.
I should make it clear, this is from a fork of FF
ah ok... :-/
thanks!
I definitely would not intend to generate a new class file or jar - I don't think (this is a matter of opinion, but I have a pretty firm one) that investigating something by altering it is good practice.
It's significantly simpler (conceptually, though goodness knows how IDEA etc will play, but I am somewhat optimistic) to just say 'hey, here's the source text and here's the line number map you WOULD have read from the classfile'.
While I don't advise anyone uses it (I expect to make it available via API, and thence to IDE plugins), it's possible to get reasonable REAL mappings now ;)
C:\code\cfr\target\classes>java org.benf.cfr.reader.Main C:\code\cfr_tests\output\java_8\org\benf\cfr\tests\LambdaTest6.class --trackbytecodeloc true
/*
* Decompiled with CFR 0.151-SNAPSHOT (b266dda).
*/
package org.benf.cfr.tests;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.function.Function;
public class LambdaTest6 {
static <T, R> List<R> map(Function<T, R> function, List<T> source) {
ArrayList<R> destiny = new ArrayList<R>();
for (T item : source) {
R value = function.apply(item);
destiny.add(value);
}
return destiny;
}
public void test() {
List<String> digits = Arrays.asList("1", "2", "3", "4", "5");
List<Integer> numbers = LambdaTest6.map(Integer::new, digits);
}
}
------------------
Line number table:
test()
----------
Line 22 : 0
Line 23 : 33
map(java.util.function.Function<T, R> java.util.List<T> )
----------
Line 13 : 3
Line 14 : 8
Line 15 : 32
Line 16 : 42
Line 18 : 54
Thats great news!
I took a peak at using it via the api and saw it didnt seem quite ready yet, especially with the SinkReturns.LineNumberMapping_DO_NOT_USE
class name ;)
Whats there of the api seems easy enough, would just need a wayt to get the classname from it.
Awesome work, and im happy to be a guinea pig to help test anything.
Hm. I'm a little saddened by how IDEA handles decompiled line numbers - unless I read it wrong, it requires you to provide a mapping from the line numbers in the LineNumberTable (which could be missing, or could be lies) and the decompiled line numbers.
This means it's not possible to get IDEA to behave nicely if line table is stripped, and an extra correlation is required to match up before/after lines.
Oh well. Still possible to do SOMETHING (screenshot of internal state of an idea plugin I just threw together)
can be used in a nice plugin to get this sort of thing working. Getting there ;)
Looks good, I have had a go at supporting this in our gradle plugin here: https://github.com/FabricMC/fabric-loom/pull/248/ Just to need to figure out why getClassFileMappings
doesnt seem to have a source line for all decompiled lines. (Im prob doing something wrong). Ive ran out of time to look into it much now.
👍 Supported in arthas: https://arthas.aliyun.com/doc/en/jad
But the output seems not to be sorted by line numbers:
[arthas@61148]$ jad java.lang.String '<init>'
ClassLoader:
Location:
public String(byte[] byArray) {
/*556*/ this(byArray, 0, byArray.length);
}
public String(byte[] byArray, int n, int n2) {
/*535*/ String.checkBounds(byArray, n, n2);
/*536*/ this.value = StringCoding.decode(byArray, n, n2);
}
public String(byte[] byArray, Charset charset) {
/*505*/ this(byArray, 0, byArray.length, charset);
}
public String(byte[] byArray, String string) throws UnsupportedEncodingException {
/*481*/ this(byArray, 0, byArray.length, string);
}
public String(byte[] byArray, int n, int n2, Charset charset) {
/*450*/ if (charset == null) {
throw new NullPointerException("charset");
}
/*452*/ String.checkBounds(byArray, n, n2);
/*453*/ this.value = StringCoding.decode(charset, byArray, n, n2);
}
👍 Supported in arthas: https://arthas.aliyun.com/doc/en/jad
@hengyunabc : I borrowed your code for the ecd (Enhanced Class Decompiler Plugin for Eclipse) https://github.com/alibaba/arthas/blob/931ce392fdc6bf675bbc3997917079c9ce3c9cb2/core/src/main/java/com/taobao/arthas/core/util/Decompiler.java#L34
But the output seems not to be sorted by line numbers:
Yes, I guess it's the order in the bytecode. I think an option could be created to sort the members (i.e. inner classes and methods) in the AST before generating the output ? It could be either bytecode order (default) or source code order.
@leibnitz27 : does this make sense to you ? I'd like to avoid to re-parse the output into an AST to re-order and re-output.
+1 having an original-line-number-aligned decompilation output is a must for debugging 3rd-party-libraries (without source code) efficiently.
You may want to check https://github.com/nbauma109/ecd for a realigned output of CFR
@nbauma109 don't get me wrong and maybe I'm a little bit too cautious, but a "1 star" github is not really trustworthy (yet). Why have you created the fork from https://github.com/ecd-plugin/ecd and what is the difference?
How can your eclipse plugin realign the line numbers if CFR does not provide any line numbers in its decompilation output?
CFR provides line number mappings in the latest release. The forked version of ecd produces line numbers as comments based on that mapping and the existing post-process of ecd realigns line numbers using the parsed line number comments. As mentioned above, a post-process task is not ideal at all but I find it better than nothing. As for your comment on popularity and trust, I'd like to remind that original ecd used to be in the top rankings of eclipse marketplace but was a whole spyware machine with privacy violating contents that other people have made a good job removing. Pretty ironical.
I've commented pretty extensively on this above - line numbers per se don't really make sense - you cant take this output and expect it to line up perfectly with the original code (even with stretching and squashing). The quality will always be poor, with lots of confusing jumps etc. (Yes, i know its been done reasonably, you can do an adequate job, but you'll never do a great job)
The sensible thing to do (anyone know someone at jetbrains? :) ) is to decompile, then tweak the line number table to match the new code, such that the bytecode locations the debugger has map to the rebuilt output.
As nbauma109 says - Cfr supports generating a bytecode map (most entities output contain information about where they are in the bytecode), so effectively generates a NEW line number table.
(This maps back to modmus' comment, but it would be so much cleaner to do it as an extra input to the debugger)
One day.......
On Tue, 9 Aug 2022, 15:41 nbauma109, @.***> wrote:
CFR provides line number mappings in the latest release. The forked version of ecd produces line numbers as comments and the existing post-process of ecd realigns line numbers using eclipse AST. As mentioned above, a post-process task is not ideal at all but I find it better than nothing. As per your comment on popularity and trust, I'd like to remind that original ecd used to be in the top rankings of eclipse marketplace but was a whole spyware machine with privacy violating contents that other people have made a good job removing. Pretty ironical.
— Reply to this email directly, view it on GitHub https://github.com/leibnitz27/cfr/issues/73#issuecomment-1209471765, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFXCEE57O2GW53UVRMABJ3VYJUZJANCNFSM4KCMTNJQ . You are receiving this because you were mentioned.Message ID: @.***>
Intellij uses fernflower decompiler which doesn't support line number realignment. However the debugger is able to point you to the right line in the code. So the debugger is somehow able to process the line mappings provided by fernflower. If Jetbrains has 'opened' the possibility to provide a decompiler implementation, we can possibly debug with any decompiler which provides line number mappings.
I would like to use the decompiled java code for debugging purposes (e.g. tracing code execution in Netbeans), which basically requires two things:
Procyon does actually a pretty good job producing such output, but unfortunately fails in other areas. Thus, I would love to see this being possible in CFR.
@leibnitz27: