Enumerator FSM not recognized when preceeded by early-return code.

tamlin-mike commented 2 years ago

I have encountered a method using an idiom I have not seen previously, that prevents the identification of the FSM and therefore decompilation. Chances are the early-return is generated code (injected IL?).

If we take a method

IEnumerator foo(args) {
    if (condition) yield break;
    yield return smth;
    yield return smth_else;
}

it is commonly transformed into

IEnumerator foo(args) {
    return new Compiler_generated_FSM { /*captures*/ }
}

with the condition check baked into the FSM MoveNext.

I have encountered a method where the early-return condition test is not in the FSM, but at the head of foo() itself, making ILSpy not recognize that it is an FSM, and therefore decompilation fails. I.e. the assembly contains

IEnumerator foo(args) {
    if (condition) { /* logging */ yield break; /* actually returns default(IEnumerator) */ }
    return new Compiler_generated_FSM { /*captures*/ }
}

If manually plowing through the FSM MoveNext, recreate compilable C# code, and finally compile it, we find that the condition check+code indeed ends up at case 0 in the FSM, as we would expect.

The fact that the condition test in the original assembly is directly in foo() itself, in combination with the code it executes (verbose logging), makes me suspect this part could be machine-generated.

Still, since everything else is identical to the pattern ILSpy do recognize and properly decompile, would it be possible to amend the FSM-recognition to handle this scenario?

To provide details, the only differences are:

Original assembly have the following IL "header" injected.

IL_0000: call check_condition
     brtrue IL_CONTINUE

     /* logging_code */
     ldloca.s 0
     initobj [mscorlib]System.Collections.IEnumerator
     ldloc 0
     ret
IL_CONTINUE:
/* FSM construction, identical to re-compiled code that ends up at IL_0000 */

That early-return code is in foo() itself, instead of in the FSM for case 0.

siegfriedpammer commented 2 years ago

What should the C# code produced by ILSpy look like? How can we ensure semantic equivalence? Note that all code prior to state-machine creation is executed immediately, while code inserted inside the state-machine body might not...

dgrunwald commented 2 years ago

If we want to support this (which I'm not sure we do), ILSpy would have to generate a pair of methods, e.g.:

IEnumerator foo(args) {
    if (condition) { /* logging */ return null; }
    return foo_fsm(args);

    IEnumerator foo_fsm(args) { /* body of MoveNext */ }
}

tamlin-mike commented 2 years ago

It seems to me that semantical equivalence would be preserved by decompiling it as-if the "header" code was at FSM case 0. From my findings the only difference is that the generated code skips the allocation+init of the FSM, if it bails out early from foo().

In other words, correct decompilation could be produced (in this scenario) by simply cut+paste the machine-generated IL part from foo into the FSM's MoveNext case 0 (*).

Daniel, does that alleviate your concern? I see no need to generate a synthetic extra method.

(*) This is identical to the actual generated IL, after manually decompiling it, turning it into compilable C#, and finally compiling it.

dgrunwald commented 2 years ago

The extra method is necessary to avoid changing the semantics if some code calls the outer method without actually iterating over the resulting enumerable. FSM case 0 only gets to run on the first MoveNext call, the code in the outer method already gets to run before iterating.

This difference is even more noticeable with IEnumerable instead of IEnumerator: it's possible to trigger FSM case 0 multiple times by starting multiple iterations over the enumerable, despite only calling the outer method a single time.

tamlin-mike commented 2 years ago

FSM case 0 only gets to run on the first MoveNext call, the code in the outer method already gets to run before iterating.

Ah, right you are! Damn.

While the code I encountered is simply generated code for "Are we active yet? If not, bail out early" there is indeed the behaviour difference you described.

Still, it bugs me that ILSpy failed to handle it, especially when we know it otherwise decompiles the FSM perfectly fine.

But... since it currently totally fails, even a slight semantic difference in decompiled code could be preferred, if it generated otherwise valid code, with the addition of a warning as a comment.

Should this be considered, I believe a reasonably straight-forward matching for this case could be: Are the last instructions in the method the FSM creation call+return (whereas currently, I suspect, it's matching on first instructions).

Optionally, and this is an idea I've had for quite some time anyway (read: years), would be to introduce a new (interactive-only) decompilation mode for generated FSM's - decompile MoveNext as-if it's the source sequence of statements + yield's from the "actual" call-site. Sure, it would reference the FSM's own captured copies of variables, but at least it could produce otherwise valid code and would help immensely for cases where this transformation is not handled (e.g. this current issue).

Speaking from experience, this is something that could have helped reduce a lot of painful, tedious and error-prone manual transformation of FSM MoveNext, especially ones with loops, into valid code.

Could that perhaps be a "better" way forward? It could keep the current code-paths in ILSpy unchanged, while adding a feature that has anyway been wanted for quite some time.

Comments?

dgrunwald commented 2 years ago

What's the problem with not decompiling the FSM? ILSpy should detect that the compiler-generated class is still necessary and output it as well (while escaping invalid characters in the class name), so you should still get valid code without FSM decompilation.

tamlin-mike commented 2 years ago

What's the problem with not decompiling the FSM?

8-O

I never thought I'd see an author of ILSpy ever ask "What's the problem of not converting a low-level hard to understand construct into its more readable, understandable and modifiable normalized form?". :-)

I think the joke is too subtle for me to fully appreciate, or even recognize at all.

As I see it the main problems are

The FSM's are not intended for human consumption. Some might even say they are quite unreadable.
Manually converting such FSM's into functioning code can be quite complex, but worse it's error-prone. Consider a scenario with two nested loops that both yield, N captured variables where some refers to more compiler-generated code such as lambda's. It can quickly become quite hairy. Recurse such constructs one or more levels, and the complexity can grow exponentially. This is a scenario where ILSpy can be a boon, untangling such mess into understandable code.
ILSpy can already decompile such FSM's into fully readable and functional C# source. It's just that sometimes, due to e.g. surrounding (machine-generated) IL it fails.

For those cases (that unfortunately are not so rare) where it fails to recognize the idiom at the call site (e.g. this issue) it would be good to have the mentioned ability to view the FSM's MoveNext as-if it was at the call-site, as a fallback. To get the proper code flow, including loops, lambda's and so on, expressed in the more natural form the tool normally is already capable of producing.

christophwille commented 2 years ago

There was no joke at all.

siegfriedpammer commented 2 years ago

To expand on Daniel's and Christoph's response: Yes, ILSpy is able to reconstruct such FSMs, however, we only officially support FSMs that are generated by compilers (more specifically the Mono C# compiler, the Microsoft C# compiler and the Roslyn C# compiler in various versions). There have been some contributions to extend that support to include some non-standard patterns and/or support for patterns used in other languages/compilers, including VB and F#. However, our official stance remains that we don't want to support all sorts of deobfuscation techniques and IL weaving/rewriting.

We have been working hard to make sure that the decompiler correctly interprets semantics, therefore we want to only apply a transformation, if there is no semantic difference. As Daniel already mentioned, in order to compensate for the "weakness" of not being able to recognize every pattern, we have added the ability to include the compiler-generated code in the output, if there is user-code that references compiler-generated members.

Of course, we are generally open to suggestions and contributions, however, in this case, we have not yet reached a definitive conclusion, whether we want such a feature in ILSpy. From my point of view there are bigger fish to fry and unfortunately, my time is currently very limited.

dgrunwald commented 2 years ago

@tamlin-mike I took your "If manually ..., recreate compilable C# code" as an indication that there's an additional problem which prevents the generated FSM code decompiled by ILSpy from being re-compilable. If that's not the case, then @siegfriedpammer's explanation applies.

icsharpcode / ILSpy

Enumerator FSM not recognized when preceeded by early-return code. #2848