google / bindiff

Quickly find differences and similarities in disassembled code
https://zynamics.com/bindiff.html
Apache License 2.0
2.16k stars 131 forks source link

Trying to add more filtering in IDA; Results seem counterintuivite #47

Open po0p opened 1 month ago

po0p commented 1 month ago

In advance: pardon my ignorance and noobishness.

Is your feature request related to a problem? Please describe. Bindiff takes a lot of time to process binaries with large number of functions (currently, working with bins with 150k+ functions). Filtering by address might not always help.

Also, consider my use case: I am trying to preserve my work between patches in the program; alas, i am not really interested in what exactly has changed between patches; I have names 20 functions that i'm interested in; and my goal in using bindiff (for the lack of more lightweight tool) is to identify those 20 functions in the new, patched binary: see if they are present, offsets have moved, etc etc.

Describe the solution you'd like More filtering to be added to diff database filtered option of ida plugin.

Describe alternatives you've considered I have tried to implement more filtering myself, and it almost works, but a little bit confusing with results. differ.cc->SetupGraphsFromProto:

if (flow_graph_infos) {
      const auto address = flow_graph->GetEntryPointAddress();
      auto& info = (*flow_graph_infos)[address];
      info.address = address;
      auto name = flow_graph->GetName();
      if (skip_sub_funcs && starts_with(name, "sub_")) {
        //this is not a named function!
          flow_graph_infos->erase(address);
          flow_graph.reset();
          skipped++;
          continue;

      }

skip sub func parameter is passed from a gui checkbox added to diff database filtered dialog. Reasoning: i have named 20 out of 150k functions, which means most of the functions i dont care about still have original ida naming (which is sub_address), so i try to skip those in the secondary (pre-patch) binary to not do unnecessary comparisons. Adding this matches all 20 target functions correctly, while taking in total 4.5 minutes instead of 1.5 hours as usual.

What is confusing:

  1. all those sub_ functions are still in bindiff results (included into secondary unmatched window mostly)
  2. trying to add that same exact code into FilterFunctions (main_plugin.cc) gives 0 performance gain. (I havent tested for full 1.5 hours, but after 10 minutes of seeing "Performing Diff" window, i got tired of waiting):
    if (entry_point < start || entry_point > end) {...}
    else if (skip_sub)
    {
    if (starts_with(flow_graph->GetName(), "sub_")
    {
        flow_graph_infos->erase(entry_point);
        delete flow_graph;
        flow_graphs->erase(it++);
    }
    }
    else
    {
    ++it;
    }

(I assumed that FilterFunctions would be more appropriate place to add the filtering).

Additional context It would be nice to add some progress indication to performing diff wait_box, or maybe even a progress bar. In 1.5hr waiting timespan I start to question my sanity and think if its even doing anything (considering I also see only 2 running threads in IDA process at the time and taking 0.01% cpu i think it doesnt; but after 1.5hr automagically result always appears :D)

cblichmann commented 1 month ago

Hi there! Thank you for this comprehensive write-up.

Couple of points: You rightly point out that to omit default named sub_-functions from IDA, you'd want to exclude them from the secondary (pre-patch). However, in the primary, you'll want to keep them, otherwise BinDiff will not be able to perform any matches on default names. So skip_sub/skip_sub_funcs in your code snippets above should only be true for the primary.

The sub_-functions being listed in "secondary unmatched" is because they are, in fact not being matched. So it's not wrong, but I agree this is maybe unexpected.

re 2): I agree that putting the filtering code inside of FilterFunctions() should be the correct place. I don't have the time to debug this right now, though.

It would be nice to add some progress indication to performing diff wait_box, or maybe even a progress bar. In 1.5hr waiting timespan I start to question my sanity and think if its even doing anything (considering I also see only 2 running threads in IDA process at the time and taking 0.01% cpu i think it doesnt; but after 1.5hr automagically result always appears :D)

Hmm, yes. Ideally, you'd be able to just continue working in IDA while BinDiff does it's thing in the background.\ We decided against that way back 15-ish years ago, as back then it was kinda risky to work with the primary database and change its state - also IDA would sometimes crash resulting in both lost diffs and lost changes in the IDB. Also, running a CPU-heavy task back then meant everything else would become sluggish and unreliable (at least on Windows).\ I think you're right, it's time to revisit that decision.

As for providing a status window: We don't know up front how long it will take, but we can at least provide an indeterminate status window that at least allows the IDA main UI thread to update so users know it's still active.

And finally, for the two processes with only one being active, this is due to BinDiff being inherently single-threaded. It's not clear how one would parallelize the core BinDiff algorithm as the various steps refine the results and depend on each other.

po0p commented 1 month ago

Hi, thanks for reading!

However, in the primary, you'll want to keep them, otherwise BinDiff will not be able to perform any matches on default names. So skip_sub/skip_sub_funcs in your code snippets above should only be true for the primary.

I think you mistyped; it is backwards, however, that is what i am already doing, e.g. skip_sub boolean from gui is passed only to Read(secondary db) (If i am not wrong and primary is current opened database in ida (new patch); secondary is one that is supposed to be closed (pre-patch)). Idea is to just exclude functions that i dont care about from OLD database; since i assume internally bindiff algorithm does kinda something like 2 foreach loops, this reduces number of iterations a lot e.g.

foreach(fun in primaryfuncs) 
    (foreach(funold in oldfuncs) 
        {findmatch...}

Also might be a nice idea to allow to filter out nullsubs/empty functions, but i dont think matching them takes any significant amount of time.

The sub_-functions being listed in "secondary unmatched" is because they are, in fact not being matched. So it's not wrong, but I agree this is maybe unexpected.

Its fine tbh; i dont care about them being there as long as i find my target ones.

re 2): I agree that putting the filtering code inside of FilterFunctions() should be the correct place. I don't have the time to debug this right now, though.

I will test it more, it seems counter intuitive that just a simple 150k string prefix comparisons take a significant amount of time, maybe i make a logic mistake somewhere; because Read is basically the same iteration count, and there it takes almost no time.

About the progress bar: I dont mind it being blocking; my request is to just indicate that something is going on. Just because, you know, seeing a hanged (because it hangs if you click it) dialog window in a process with 2 running threads taking 0.1% cpu does not really indicate that something is going on :) Here is what i added while tinkering with it:

progress_test += absl::StrCat("\nExported primary in",
                              HumanReadableDuration(diff_timer.elapsed()));
wait_box.ReplaceText(progress_test);
diff_timer.restart();
....
progress_test += absl::StrCat("\nExported secondary in",
                              HumanReadableDuration(diff_timer.elapsed()));
wait_box.ReplaceText(progress_test);
diff_timer.restart();
....
progress_test += absl::StrCat("\nTwo read and context ctor in",
                              HumanReadableDuration(diff_timer.elapsed()));
wait_box.ReplaceText(progress_test);
diff_timer.restart();
....
progress_test += absl::StrCat("\nFilter funcs (source) in",
                              HumanReadableDuration(diff_timer.elapsed()));
wait_box.ReplaceText(progress_test);
diff_timer.restart();
....

For Diff(...) function - maybe have an std::atomic counter inside for loop and periodically update wait box text with e.g. "processed x out of y matching steps" in a thread?

Also, see https://github.com/nihilus/IDA_WaitBoxEx - idk if its up to date, but seems like a nice thing to test!