open-goal / jak-project

Reviving the language that brought us the Jak & Daxter Series
https://opengoal.dev
ISC License
2.81k stars 173 forks source link

[decompiler] Better Type Analysis #569

Closed water111 closed 2 years ago

water111 commented 3 years ago

This issue is to collect ideas for improving the type analysis pass. IMO it's the worst part of the decompiler and errors here cause the decompilation to completely fail rather than just looking messy.

The current type analysis pass has a few issues:

The current implementation isn't that big: about 250 lines for the algorithm, ~1000 lines for the implementation for all of the atoms, and ~500 lines for the type system reverse lookups. Doing a significant rewrite is reasonable.

Idea 1: More expressive TP_Type

The type propagation pass doesn't use TypeSpec directly, but instead TP_Type, which is like a TypeSpec plus some extra information. For example, it can represent <int * 4> or <the type res-lump>.

It should be possible to nest some parts of this type. For example you would want something like <int * 3 + int * 20 + int * 190 + 16> in order to be able to figure out some nested inline access. Maybe you could just nest these infinitely?

Idea 2: "Multi-Type" and checking stores/calls

In cases where it's possible to follow different paths of dereferencing, we could build a tree of possible deref decisions and their associated type. We would then modify the type analysis pass to add these extra steps:

To pick the best:

To handle all cases, these trees might need to show up in every use of the variable:

(func var)
(func (-> var 0))

both have no explicit add/load, but have multiple possible cases where either could be right. I believe we can only build this type of tree if there is a store or call that fails a typecheck, and avoid it in all other cases.

The chosen paths could then be passed on to the expression pass.

It might be quite tricky to make this work for a variable that's used like this:

(let ((var (something)))
  (dotimes (i 10)
    (set! var (something-that-depends-on-the-type-of-var))
    )
  ;; after the loop
  (some-func (-> var field2))

where there a multiple choices for (something-that-depends-on-the-type-of-var) that work, but only one works for (some-func (-> var field2)) after the loop. But I'm having a really hard time coming up with an example of code like this that's not annoying on purpose. In most cases there is immediate feedback if you have the right type or not before you have to do any least-common-ancestor. I don't think this is a problem because you can just handle these cases with a deref path hint, or maybe we could come up with a way to do LCA with trees as well - try all pairs and pick the one with the largest depth in the type tree (most specific, and most likely to keep a loop variable a consistent type).

All of this can be seen as "greedy" and trying to get as far as possible until there's a merge/phi node like thing, which is definitely not going to give you the perfect answer in every case, but solving this problem in general is really hard:

The implementation of this would be tricky - these trees might get big and correctly updating previous trees when pruning may be hard. With a good design of the tree, it should be possible.

Idea 3: Handle a few cases that are common problems:

Idea 4: Better Errors

When type propagation fails, it should spit out detailed error data. This could be used to generate good error messages:

Register a1 contains type string from (-> <res-lump in a3> data blah) which does not have a field (4-byte unsigned) at offset 32.

Or it could be used to search some database of info collected from PCSX2 and suggest a different type, or where to set a breakpoint to capture this state in the game for more analysis.

It could also have a mode for "ignore errors" where it just tries as hard as possible to succeed, even if it generates very ugly code. This wouldn't be the default, but might be useful for debugging.

water111 commented 3 years ago

Some more features that came up:

water111 commented 3 years ago

Implementation plan:

water111 commented 2 years ago

the multi-type stuff was attempted, but didn't work out well. The new type pass with "tags" accomplishes most of this stuff and remaining problems are captured in more recent issues.