inoueke-n / optimization-detector

Optimization detection over compiled binaries
MIT License
7 stars 2 forks source link

About installation #2

Open anzosasuke opened 3 years ago

anzosasuke commented 3 years ago

Hi, Im trying to install it .. Im trying to generate the image in docker. But having some python3-pip dependency problem. Could you tell me which system did you work on for this tool?

davidepi commented 3 years ago

I tried to rebuild the docker image and... yes that's a bug I didn't notice (because my docker was using the cache).

Thank you very much, I'm gonna fix it soon.

In the meantime, I strongly suggest you to use the dataset that we already generated available at this URL, for the architecture you are interested in, instead of creating it yourself.

I left the instructions in the readme just in case somebody want to replicate also the dataset generation. However, that step is very CPU intensive, and would probably take something around 1 week to generate all architectures. Moreover we created x86_64 and aarch64 manually so that part is not present in the script.

anzosasuke commented 3 years ago

Yeah I will check out on pre-trained model first then, try to move to dataset generation ultimately. I am grad-student and I want to make a compiler detector with ML, I think learning to use your tool would be a great starting point. I am pretty new at this(binary analysis and ML). Lets hope to get it to work. Thanks for the prompt reply.

anzosasuke commented 3 years ago

Hi, I did something. python3 optimization-detector.py infer -m mips-flags-lstm.h5 -o output.csv busybox , Also this code is just to detect optimization, is it??

And could you tell me the difference between these two trained models x86_64-flags-lstm.h5 and x86_64-compiler-lstm.h5?

Can you checkout the output?

output.csv

busybox.zip

I could be doing something horribly wrong, but I dont understand the output. Sorry i asked a lot of questions.

davidepi commented 3 years ago

Welp, it appears that I forgot to provide any explanation for the output file, you are right.

Essentially, when submitting an executable its .text section is subdivided into several chunks of 2048 bytes and each chunk is classified. The output.csv contains the classification of each chunk. I then usually pick the value appearing the most. Note that in the evaluation everything was done "per-chunk" instead of "per-binary", so the accuracy in the paper is referred to every single chunk of 2048 bytes.

Regarding the two trained models:

anzosasuke commented 3 years ago
  1. Regarding the output, the outputs seem to have all the optimization level 1, 2 ,3, 4 in those chunks, how would you identify which optimization level is there??