Standard api - Githubissues

Mm7 commented 5 years ago

Initial attempt to solve #172. If you like this proposal I'll add support for more analysis. Suggetions and comments are welcome! Thank you.

XVilka commented 5 years ago

@Mm7 do you have a telegram? We have a dedicated radeco channel - would be awesome if you join it.

Mm7 commented 5 years ago

I'd rather prefer to keep my telegram personal, but if this is the only way to coordinate I'm fine with joining.

XVilka commented 5 years ago

You can join #radeco IRC channel on Freenode if you like.

XVilka commented 5 years ago

@chinmaydd have a look please too.

chinmaydd commented 5 years ago

Hi @Mm7, this is great work! I have some comments which I would like to hear your thoughts on:

I feel that the AnalyzerResult trait should specify more requirements for an implementation. This could include mechanisms to verify/collect changes that were made during the analysis passes.
Some analysis passes might inherently depend on the other for better/optimal results. I would like if you could incorporate a way to take such information into account, especially for the run_all_analysis() call. For example, it might be beneficial to run DCE after a high-level analysis pass (such as CSE) to take advantage of more optimizations. (I could be wrong here, in which case ignore it).

Mm7 commented 5 years ago

Hi @Mm7, this is great work! I have some comments which I would like to hear your thoughts on:
* I feel that the `AnalyzerResult` trait should specify more requirements for an implementation. This could include mechanisms to verify/collect changes that were made during the analysis passes.

To do that, I think that the best approach is splitting the job in two different passes:

The user calls analyze to get a list of Changes that can be applied to the IR (which is left untouched). analyze could also be made return some extra infos.
The Changes are inspected and applied as taste (by the user).

* Some analysis passes might inherently depend on the other for better/optimal results. I would like if you could incorporate a way to take such information into account, especially for the `run_all_analysis()` call. For example, it might be beneficial to run DCE after a high-level analysis pass (such as CSE) to take advantage of more optimizations. (I could be wrong here, in which case ignore it).

To do that you need analysis-specific knowledge (for example you need to know that DCE should be run after CSE). run_all_analysis was a "dump" way to run all the analysis without knowing which ones are available and when they should be called. Probably we should remove that function.

In my opinion, the user (aka the one that calls the analysis) should be implemented somewhere else (like in a new module). Here we can take care of calling the more appropriate one, apply useful Changess and drop the useless ones. Analyzer should only provide a uniform interface to all the possible analyzer passes.

XVilka commented 5 years ago

@Mm7 I agree on preserving also the original IR tree before the analysis pass. This way we can run two non-dependent analysis passes in parallel, maybe compare the results and apply only one or merge them into one.

Mm7 commented 5 years ago

I'm trying to implement a graph-preserving version of copy_propagation. but I'm facing a problem: as soon as you start collecting changes to apply, you have no way to analyze the graph as it would look like after you applied all the previous changes, you can only see it in the original form!

For example, imagine that dce is called on the following IR. The changes vector (aka the list of changes to return to the user) is empty.

%1 = 10
%2 = 20
%3 = 30
return %2

The analyzer will notice that %1 is unused, so a removeNode(%1) is pushed to the changes vec. Let's suppose that when executed removeNode not only removes %1 but also renames %2 -> %1, %3 -> %2 (in order to keep ids small). The analyzer is oblivious of this because changes are applied by the user after the execution of the analyze func. If will then notice that %3 is unsued, so again a removeNode(%3) is pushed. When the user will try to apply the changes It will lead to wrong results (return of %3 instead of %2, removal on non-existing %3).

You may argue that my example is useless because removeNode does not rename stuff, and you would probably be correct. But still, In case of complex changes (not just removing nodes), you could get some nasty side effects messing up the graph. Keeping track of all of this would be a nightmare, IMHO.

Here's few solutions I have in mind:

Clone the graph, work on the copy and then drop it when you are done. (I suppose this would be horrible performance-wise).
Have the ssa accessors know that we are working in a slightly different graph by passing the changes` as parameter. They will take care of returning the correct result. (I suppose this would be horrible to get working properly)
Make a list of what each change breaks of the graph and what is preserved. For example, If the change: removeNode breaks the property: nodeIdsNotTouched, then the analyzer will not be allowed to run IR accessors that would change their result in case of a change in node ids. (rust's borrow checker could be exploited to enforce these rules)

XVilka commented 5 years ago

While 1 and 2 look saner it will be a pain to run it on any sufficiently large programs indeed. So looks like 3 is the only viable choice. @kriw @chinmaydd @radare what do you think?

Mm7 commented 5 years ago

@XVilka Maybe we could also add an indirection layer to access the graph: that layer should provide normal access while we are not inside an analysis, but during analysis it should collect all the changes applied to the graph; when an accessor is called it should dynamically patch the result according to the previously collected changes. This way changes could still be applied during the analysis pass without touching the graph. This idea should be simpler to implement than 2. because it is at a "lower" level (aka graph storage level rather than ssa accessor level).

The access mechanism would look like: graph wrapper

kriw commented 5 years ago

While 1 and 2 look saner it will be a pain to run it on any sufficiently large programs indeed. So looks like 3 is the only viable choice. @kriw @chinmaydd @radare what do you think?

I think the combination of 1 and 3 is also viable. We can reduce the number of clones with 3 and possibly a few clones are acceptable.

chinmaydd commented 5 years ago

Sorry for my delay in response.

run_all_analysis was a "dump" way to run all the analysis without knowing which ones are available and when they should be called. Probably we should remove that function.

Agree. I do not see any consumer interested in calling run_all_analysis() since it does not really provide any control over order and choice.

Maybe we could also add an indirection layer to access the graph

The design looks good, but I agree with @kriw. How much performance hit due to cloning are we talking here ? Dont think it should be too much.

Mm7 commented 5 years ago

@XVilka @chinmaydd @kriw I just uploaded a new version.

I added an Action enum to store "atomic" changes to the IR. For now only "replaceValue" is implemented. ASAP I'll add one variant for each one of the SSAMod accessors.
For now I just used the clone way.
How should we handle dependencies among analyzers? (for example CSE needs SSASort to be run) Should we run the needed analyzers on entry? Or require the user to honor dependencies before calling the analyzer (In this case how can we enforce/ensure this is done?)? Maybe a third way is simply making analyzers independent of each other.
There is some redundancy` when applying the change to the cloned graph and pushing the change to the vector. I think that a wrapper would be useful here; furthermore it could also provide some logic to avoid clones when not needed.

XVilka commented 5 years ago

For dependencies - I think not to run, but to check if analysis stage was running before. Of course making them independent is better. The goal is to be able to run them multiple times, like LLVM passes for example.

XVilka commented 5 years ago

@Mm7 ping?

Mm7 commented 5 years ago

Sorry for the delay. Currently I'm studying hard for my exams. But I'm still working on this

XVilka commented 5 years ago

@Mm7 ah, ok then. Good luck with your exams!

Mm7 commented 5 years ago

@XVilka @kriw @chinmaydd After some local work on this patch it seems that things will get messy when dropping Changes. Suppose an analyzer produces the changes X, Y, Z. Due to the way they are created you must first apply X, then Y, then Z. You cannot skip Y and apply directly Z because the analyzer generated Z while analyzing the IR as it looks like after you apply Y. For example, let's apply DCE on the following

%1 = 5
%2 = %1
%3 = %2
return 100

A possible output could be: changes=['remove %3', 'remove %2', 'remove %1']. You can't apply changes[2] without applying first changes[1,2] (%2 and %1 would be referencing to a non-existing variable). In general, I don't really see a solution unless some mechanism is added to sort of "cherry-pick" (Let me borrow this term from the git world :smile: ) a change and apply it to a different IR. But honestly I don't think this is a good idea because 1) that mechanism should be analyzer specific 2) debugging won't be easy in case of errors during the cherry-pick procedure 3) implementation will be probably complex. If you have some better ideas let me know: I would be happy to implement it.

On the other hand we could drop this analyze-without-touching thing and simply let the analyzers do their job. I agree that collecting/applying changes on the consumer side is nice, but IMHO it is not worth it.

XVilka commented 5 years ago

@Mm7 well, if it is too hard to implement, then it is probably a wrong solution. Lets proceed then with mutable transformations then.

XVilka commented 5 years ago

@kriw @chinmaydd please take a look again to the proposed API.

kriw commented 5 years ago

I think this is a good starting point for a standard api. LGTM :)

chinmaydd commented 5 years ago

Yep, seems good @Mm7 :-)

Mm7 commented 5 years ago

OK, thanks for you feedback. Next week my exams will be over so I'll have more time to finish this.

Mm7 commented 5 years ago

@XVilka @chinmaydd @kriw IMHO this is a good point to push this PR. Please can you take a look? Thank you. Note: https://github.com/radareorg/radeco/pull/66 updates radeco to use this api

XVilka commented 5 years ago

Looks good, with this API code looks much clearer, good job!

kriw commented 5 years ago

LGTM :)

radareorg / radeco

Standard api #239