googleprojectzero / functionsimsearch

Some C++ example code to demonstrate how to perform code similarity searches using SimHashing.
Apache License 2.0
559 stars 97 forks source link

More training steps lead to inaccurate matching result #15

Open 4B5F5F4B opened 5 years ago

4B5F5F4B commented 5 years ago

I generated a weight file by setting -train_steps to 100 and got pretty good matching result, but then I tried to generate another weight file by seeing -trains_steps to 500 and got nothing matched. I think more training steps should ensure more accurate matching result, am I right?

I attached the executable files(libpng 1.2.54 compiled by gcc-6.3, gcc-7.3, gcc-8.2 with different options) I used in my training and the input file I used to match. ELF.zip pngtest_libpng_12_54.zip

matching

4B5F5F4B commented 5 years ago

matching2

thomasdullien commented 5 years ago

Hey there,

awesome report, thanks for this.

The answer is a bit complicated: More training steps is only guaranteed to give you better matching results on the examples that you train on.

For "unseen" examples, overtraining / overfitting can occur; it can be seen in the diagram on this slide:

https://docs.google.com/presentation/d/16r_AUSWmtGw0CNxRg60VlTqkjBRxlvjEgxF10O0imk4/edit#slide=id.g427b6e6213_2_37

For the example of "find more variants of a function we already have N examples for" and the training set in the presentation, the training starts making results worse from about 420 steps onward. For the example "get better at recognizing functions you have never seen before, just learn about compilers", this happens much earlier -- before 100 training steps.

One of the steps I want to take in the future to reduce overfitting is to migrate the training code to use either Tensorflow or Julia and switch from L-BFGS to SGD-based algorithms. This should allow increasing the training data significantly, which should help reduce overfitting risk...

Cheers, Thomas

Am Mo., 12. Nov. 2018 um 04:38 Uhr schrieb 4B5F5F4B < notifications@github.com>:

[image: matching2] https://user-images.githubusercontent.com/19218802/48325194-667c9880-e66f-11e8-8ddb-164abb8cee1c.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/googleprojectzero/functionsimsearch/issues/15#issuecomment-437743056, or mute the thread https://github.com/notifications/unsubscribe-auth/AEYBvAwCiqK5ksFjW9U4rcfAiRGZ_95Eks5uuO0bgaJpZM4YY5TZ .