googleprojectzero / functionsimsearch

Some C++ example code to demonstrate how to perform code similarity searches using SimHashing.
Apache License 2.0
558 stars 97 forks source link

Unefficient creation of repulsion sets #17

Open MohamadMansouri opened 5 years ago

MohamadMansouri commented 5 years ago

When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions. but with different compilers, and platforms the function prototype does not remain the same. Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets. Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.

For example: WideToChar(wchar_t const, char, unsigned long) BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp WideToChar(wchar_t const, char, unsigned long) BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk= these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different

QuickOpen::ReadRaw(RawRead&) BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ== QuickOpen::ReadRaw( RawRead&) BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik= The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.

There are many cases as this issue.

Suggestion: Use the name of the function as a symbol (without the parameters)

thomasdullien commented 5 years ago

Hey there,

thanks for the report, and ouch. This indeed looks like it would cause troubles.

The converse (e.g. only using the function name) may cause different issues, though -- not for the repulsion sets, but for the attraction sets (two different functions with same name but different arguments, for example). So I guess the "best" solution would be to use the prototypes for attraction pairs, and not use them for the repulsion sets?

Cheers, Thomas

MohamadMansouri commented 5 years ago

Hey, Thanks for the fast reply, I would like to add that creating symbols from function prototypes results in finding 1176 distinct function in the unrar dataset (ELF + PE) while creating symbols from function name will give 879 distinct functions this means that near 297 are repeated functions which they represent 25% of the dataset. On the other hand and after thinking twice about it, its low chance that this may cause a big problem since, for a problem to occur the probability is eq
the number of problamatic pairs are 0.00021 * N where N is the number of repulsion pairs for 500000 repulsion pair we may have 107 pairs of function that are declared as repulsion pairs but they belong to the same function

I agree to what you said regarding the solution

I will create a pull request base on what you said. Please have a look.

thomasdullien commented 5 years ago

Hey there,

one thing I was wondering -- stemsymbols is normally responsible for trying to "canonicalize" symbols from different compilers, thinking about the problem in question, it seems like the best place to fix this would be to 'fix' stemsymbols?

Do you happen to have a few example strings that are incorrectly viewed as different?

Cheers, Thomas

Am Fr., 28. Dez. 2018 um 01:37 Uhr schrieb Mohamad Mansouri < notifications@github.com>:

Hey, Thanks for the fast reply, I would like to add that creating symbols from function prototypes results in finding 1176 distinct function in the unrar dataset (ELF + PE) while creating symbols from function name will give 879 distinct functions this means that near 297 are repeated functions which they represent 25% of the dataset. On the other hand and after thinking twice about it, its low chance that this may cause a big problem since, for a problem to occur the probability is [image: eq] https://camo.githubusercontent.com/1e1b1d5abc24fc1318b2b14130f0d6c3ee9f7b3b/68747470733a2f2f6c617465782e636f6465636f67732e636f6d2f6769662e6c617465783f282535436672616325374232393725374425374231313736253744292825354366726163253742312537442537423131373525374429 the number of problamatic pairs are 0.00021 * N where N is the number of repulsion pairs for 500000 repulsion pair we may have 107 pairs of function that are declared as repulsion pairs but they belong to the same function

I agree to what you said regarding the solution

I will create a pull request base on what you said. Please have a look.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/googleprojectzero/functionsimsearch/issues/17#issuecomment-450262768, or mute the thread https://github.com/notifications/unsubscribe-auth/AEYBvJ02QCO4PQoPSSw_2qaZntJcIHDRks5u9WfRgaJpZM4ZjDlA .

MohamadMansouri commented 5 years ago

running this on the output of the generated training data of Unrar. cat extractedsymbols | awk '{system("echo "$4 "| base64 -d; echo")}' | sort | uniq This what you get func_uniq.txt Regarding to what you thought I am afraid you are not completely right as if you only take into account the ELF file (which are not undergoing the stemsymbol thing) you find the same problem. This can be proved by running this command cat extractedsymbols | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d; echo ")}' | sort | uniq This what you get func_ELF_uniq.txt

Regards, Mansouri

thomasdullien commented 5 years ago

Hey there,

interesting, thanks for checking. I will ponder the next two days how to best deal with this...

Cheers, Thomas

Am So., 30. Dez. 2018 um 02:58 Uhr schrieb Mohamad Mansouri < notifications@github.com>:

running this on the output of the generated training data of Unrar. cat extractedsymbols | awk '{system("echo "$4 "| base64 -d; echo")}' | sort | uniq This what you get func_uniq.txt https://github.com/googleprojectzero/functionsimsearch/files/2716851/func_uniq.txt Regarding to what you thought I am afraid you are not completely right as if you only take into account the ELF file (which are not undergoing the stemsymbol thing) you find the same problem. This can be proved by running this command cat extractedsymbols | awk '$2 ~ /ELF/{system("echo " $4 " | base64 -d; echo ")}' | sort | uniq This what you get func_ELF_uniq.txt https://github.com/googleprojectzero/functionsimsearch/files/2716873/func_ELF_uniq.txt

Regards, Mansouri

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/googleprojectzero/functionsimsearch/issues/17#issuecomment-450533963, or mute the thread https://github.com/notifications/unsubscribe-auth/AEYBvMiaiv0L2lLEL2vaGyuq6ihvbpFFks5u-B3jgaJpZM4ZjDlA .