joxeankoret / diaphora

Diaphora, the most advanced Free and Open Source program diffing tool.
http://diaphora.re
GNU Affero General Public License v3.0
3.58k stars 371 forks source link

Callgraph Algorithm issue #83

Open michaelii opened 6 years ago

michaelii commented 6 years ago

Hi I am Michael

I am the new student at this tool Diaphora.

I want to realize what algorithm of callgraph you use in this program.

There are two values callgraph_primes callgraph_all_primes

I have been check the code .I want to realize what's main idea of this two data.

Why these two data can proof the Binary Difference of Callgraph?

If you want me support more detail. Please contact me.

Thanks

University of Texas at Arlington Michael Ho

joxeankoret commented 6 years ago

Hi Michael,

The call graph's primes are calculated in the following way: the n-th prime number corresponding to the cyclomatic complexity of the flow graph of each function is calculated and, then, all generated primes are multiplied together (small-primes-product, SPP for short). The mathematical formula if you want is this.

This is an attempt to generate a fuzzy graph hash that can match quickly (just a comparison of a hash instead of applying graph matching algorithms, which are very costly) if 2 call graphs "look like" structurally equal. In big enough programs, the odds of the cyclomatic complexities of 2 different programs being the same are extremely low.

Can the generated call graph hash proof that binaries are equal? No. But the collision possibilities are so low that false positives are negligible.

Then, there is another part: callgraph_primes is the final multiplication of all the calculated primes(the SPP). What is callgraph_all_primes? A dictionary with each prime number and the number of times each prime number appears in the binary. Why is this data saved? Let's say that the SPP of the cyclomatic complexities are different for 2 binaries. It doesn't mean that they are totally different. How can I quickly compare how many functions are different? Easy, just perform the intersection of both groups and check how many functions are different.

I hope my explanation helps. If anything is yet unclear, tell me!

michaelii commented 6 years ago

Hi Appreciate your help. It's very clear for me to understand this idea.

I have further question of cyclomatic complexity and call graph signature.

For example Graph A: Edge = 3 Node = 4 --> CC= 3-4+2 =1 Graph B: Edge = 4 Node = 5 --> CC= 4-5+2 =1

We can easily draw this two graph by hand written. Obviously, this two callgraph is different. However, they share the same CC value.

How can we distinguish in this situation? Briefly, in other words, The same CC value may include different callgraph.

If you need more information, please contact me

Thanks

Michael

joxeankoret commented 6 years ago

Hi Michael,

You're confusing the cyclomatic complexity of the flow graphs with the calculated hash of the call graph. The cyclomatic complexity is just calculated for the flow graphs, not for the whole call graph.

Also, while many functions might share the same CC for similar flow graphs, for non trivial binaries, it's highly unlikely that 2 binaries will cause collisions.

Thanks & regards, Joxean Koret

michaelii commented 6 years ago

Hi Thank you for your opinion Still a little bit confusing. And sorry to disturb you again and again

Q1: What's the different between flow graph and call graph? Q2: The CC value is defined CC = E-N+2 which comes from the call graph from each function. is it right? Q3: "You're confusing the cyclomatic complexity of the flow graphs with the calculated hash of the call graph" I don't understand the part of "hash"

Thanks

Michael

joxeankoret commented 6 years ago

Hi Michael,

Q1: Flow graphs are functions and the call graph is the relationship between functions. Q2: It comes from the Control Flow Graph of each function, not from the call graph. BTW, the definition for CC is not that, but is the one I'm using because is "good enough". The formal definition is "CC = E - N + 2P", where "P" is the number of connected points. Q3: Consider "hash" as a mathematical function that takes as input graphs an outputs a big number. The final big number generated by that function is the actual "hash". Similarly to what happens with cryptographic hash functions (i.e., MD5 or SHA1), you input data and it outputs a result, the resulting data is what is considered the final hash.

michaelii commented 6 years ago

Hi

Thank you for your feedback. You light me about the call graph and flow graph However, I still cannot connect the idea between "generate a fuzzy graph hash"

"small-primes-product, SPP" is a kind of hash number for fuzzy graph hash am I right?

What's the term of "fuzzy graph"? I check your study material but cannot find some detail about it.

Appreciate

Michael

michaelii commented 6 years ago

Hi One more tricky question I realize the Prime table correspond to the CC value.

Question : Why you need to use "prime"

we can just use CC value to be the signature. ( Also can used to do Binary difference funciton) Why we need to use prime table ( for storage issue? unique? )

Thanks

Michael

joxeankoret commented 6 years ago

Hi Michaelii,

We're mixing things here :) So, we have a "graph". Then, we have a "hash" that is the result of applying a function to a "graph". The function to output a hash given an input graph will have some properties. In the algorithm I'm using one of the properties I wanted to have is to be able to match "similar things". So, my algorithms uses "fuzzy logic", in the sense that similar things will have the same hash. This is why I say that it's a "fuzzy graph hash". Perhaps it's less confusing to say it's a graph's "fuzzy hash".

Now, the small-primes-product. We cannot properly compare graphs as, say, a list. Nodes might be re-ordered although they are the same. The SPP algorithm assures that even if functions are re-ordered, it will not affect the output of the algorithm as it will just multiply the generated primes associated to each function regardless of its position.

I didn't invent it myself. Actually, if I'm not wrong that was "an invention" of either @halvarflake, @erocarrera or @rolfrolles. You can read more about it here: https://static.googleusercontent.com/media/www.zynamics.com/es//downloads/bindiffsstic05-1.pdf

The explanation in that paper, although a bit dense, maybe, will be better than the one I can give you with my broken English & math skills.

michaelii commented 6 years ago

Hi

You are very nice!! I very love this tool ! I want to dig in it

your PDF link is fail. Can you upload the new one ?

Thanks

Michael

joxeankoret commented 6 years ago

Google for "Graph-based comparison of Executable Objects"

jiangming commented 6 years ago

The link is here: https://static.googleusercontent.com/media/www.zynamics.com/zh-CN//downloads/bindiffsstic05-1.pdf

RolfRolles commented 6 years ago

(To clarify, credit rightfully belongs to Halvar for the prime products technique.)

michaelii commented 6 years ago

Hi joxeankoret

Thank you for your information. We really appreciate your help. This week I read the paper and do some study on the Diaphora material. I have another further question

Q1:
In the Diaphora.py In the function -- "def check_callgraph(self)" In this function, I saw the code

  cg1 = decimal.Decimal(rows[0]["callgraph_primes"])
  cg_factors1 = json.loads(rows[0]["callgraph_all_primes"])
  cg2 = decimal.Decimal(rows[1]["callgraph_primes"])
  cg_factors2 = json.loads(rows[1]["callgraph_all_primes"])
    FACTORS_CACHE[cg1] = cg_factors1
    FACTORS_CACHE[cg2] = cg_factors2

It is very simple to get the factor of prime product. However, in the code you do not use those variable FACTORS_CACHE[cg1] FACTORS_CACHE[cg2]

Why do you put this code in the diaphora.py? Moreover, in the "difference(cg1, cg2)", you re-generate the factor again in the code, why don't you just use cg_factors1 and cg_factors2? It may save more time to re-calculate the factor

Q2:
I have question about Mnemonics small-primes-product. In the diahpora_ida.py, in the function " def read_function(self, f, discard=False )"

    if mnem in cpu_ins_list:
      mnemonics_spp += self.primes[cpu_ins_list.index(mnem)]

Based on the another spp function. I always saw " * = " , why here put " += " ? Is it a misunderstanding ?

If you need more information, please send the comment.

Thanks !!

Michael

joxeankoret commented 6 years ago

I will need some more time to answer you Q1 (not today, sorry) but Q2 looks like you found a bug!

michaelii commented 6 years ago

Hi Thank you for your respond. This week I figure out the partial Q1 by myself.

Let me modify my question

Q1: Why we need "callgraph_primes"

I check the code again. You use the Factor_cache to save the factor of the product of primes. It's fine. It make sense for us to understand it. However, why we need the "callgraph_primes"? d1: callgraph_all_primes can generate the product. We do not need to waste 1 column for SQL d2: In the factor.py in function "def _difference(num1, num2):"

for num in nums: if FACTORS_CACHE.has_key(num): x = FACTORS_CACHE[num] else: x = factorization(long(num)) FACTORS_CACHE[num] = x s.append(x)

 which kind of situation can make program to enter "else". In my side, It is impossible that the 
 FACTOR_CACHE[] = NULL

 In general, just want to know "callgraph_primes"? why we need this spp.  I can say 
 that"callgraph_all_primes" is more powerful than it. We don't need to waste 1 column for it.

If you need more information, please contact me

Thanks

Michael

joxeankoret commented 6 years ago

The reason is automation. Yeah, we can factor the primes. However, if I'm doing massive analysis of thousands of databases I don't want to factor again the primes. This is the only reason.

joxeankoret commented 6 years ago

Leaving it open forever as I believe it's pretty useful for people that want to understand how it works.

kuqadk3 commented 4 years ago

In mnem_spp hash algorithm, have you considered the case where there is a call but to different function For example

mov

xor

mov

call function_1
mov

xor

mov

call function_2

It will have the same hash but two different function/blocks of code. It also happens with jmp,...

joxeankoret commented 4 years ago

That is a known problem. It isn't usually a problem for "big enough" functions, but it is a big problem with small functions.

kuqadk3 commented 4 years ago

Is there any technique/algorithm to deal with this?