joxeankoret / diaphora

Diaphora, the most advanced Free and Open Source program diffing tool.
http://diaphora.re
GNU Affero General Public License v3.0
3.6k stars 371 forks source link

Missed results in Diaphora #168

Closed alvarofe closed 4 years ago

alvarofe commented 5 years ago

The target that I usually RE updates regularly and I use Diaphora to import functions names and such or study difference between two different releases - basically like everyone. However, I always noticed that some functions were quite obvious they were the same but Diaphora missed somewhow and I lost quite a time importing everything. Basically, although the function might have suffered some changes they contained the same debug strings such as.

if (debugEnabled())
{
   log(__FILE__, __LINE__, "Error message at function blabla");
}

Therefore, I was expecting to see them on "Partial matches" or "Best matches" with the description "Same constants" - after all the function in main and diff contain exactly the same constants.

In contrast, what I was seeing was that the function was matching by "Some rare constant" against another function because this one contained this.

if (debugEnabled())
{
   log(__FILE__, __LINE__, "other debug message");
}

The same constant for both functions were __FILE__. But it did not take into account that there is another function that contains the same constants and even with better ratio 0.6 vs 0.34 in my context.

The issue resides in add_matches_from_query_ratio_max - although it would apply to any add_matches_from_*. The logic of these functions is to run SQL queries (in a thread) and calculate the ratio and save the result. To avoid repeat the same function over and over, adding multiple items, it uses the sets matched1, matched2. Therefore, the query "Some rare constant" might run before that "Same constant" adding a match but on the other hand it would forbid that way a better match from "Same constant".

In conclusion on my target the function blabla was matched against lala with a ratio of 0.34 using "Same rare constant", but using "Same constant" it finds a match of blabla against jijij with a ratio of 0.6 but due to blabla was found in matched1 it was discarded.

Is this a known issue or accepted compromise? It would require to cache the results and pick up the best ratio but that will slow down everything. Likewise, it might help to prioritize threads and assign each heuristic to a queue that should be run before others since "Same constant" is better than "Same rare constant".

joxeankoret commented 5 years ago

Diaphora refuses duplicate results; instead, it assumes the first one is the best one. Usually, this model works "good enough", as the heuristics are launched from best to worst. However, it's probably time to change this behaviour to choose the best results. I agree.

Also, please could you share some binaries to test this specific issue? It's extremely hard to reproduce otherwise.

joxeankoret commented 4 years ago

The fix for this bug is going to be published "soon". The fix is to, simply, move "Same constant" to the "Best" category instead of "Partial". However, I believe it's more a workaround for you specific problem than a real fix.

joxeankoret commented 4 years ago

Fixed in version 2.0.