googleprojectzero / functionsimsearch

Some C++ example code to demonstrate how to perform code similarity searches using SimHashing.
Apache License 2.0
558 stars 97 forks source link

Can I generate the training data using my own file? #21

Closed Sunxj888 closed 2 years ago

Sunxj888 commented 2 years ago

For example, I want to generate the training data using OpenSSL, so I first compile OpenSSL with x86-64, arm, mips on Linux. but when I run ./generate_training_data.py , some errors ocurred. Like:

functionfingerprints: /code/functionsimsearch/third_party/dyninst-9.3.2/boost/src/boost/boost/smart_ptr/shared_ptr.hpp:693: typename boost::detail::sp_member_access::type boost::shared_ptr::operator->() const [with T = Dyninst::InstructionAPI::InstructionDecoderImpl; typename boost::detail::sp_member_access::type = Dyninst::InstructionAPI::InstructionDecoderImpl]: Assertion `px != 0' failed. Failure to run functionfingerprints (ELF:./ELF/openssl/openssl-arm.ELF->32ee2803b259760e) Done with functionfingerprints. (ELF:./ELF/openssl/openssl-arm.ELF->32ee2803b259760e) Running dotgraphs on all files. dotgraphs: /code/functionsimsearch/third_party/dyninst-9.3.2/boost/src/boost/boost/smart_ptr/shared_ptr.hpp:693: typename boost::detail::sp_member_access::type boost::shared_ptr::operator->() const [with T = Dyninst::InstructionAPI::InstructionDecoderImpl; typename boost::detail::sp_member_access::type = Dyninst::InstructionAPI::InstructionDecoderImpl]: Assertion `px != 0' failed. Failure to run dotgraphs (ELF:./ELF/openssl/openssl-arm.ELF->32ee2803b259760e) Done with dotgraphs. (ELF:./ELF/openssl/openssl-arm.ELF->32ee2803b259760e) Obtaining function symbols from ./ELF/openssl/openssl-arm.ELF... got 743 symbols...Getting disassembled functions. got 0 functions...Opening and writing extracted_symbols_32ee2803b259760e.txt. Sorting...Writing...Done (wrote 0 symbols) Processing PE training files to extract features... ./PE/*/.exe Returning list of files from PE directory: [] Running functionfingerprints on all files. Running dotgraphs on all files. Loading all extracted symbols and grouping them... Checking filename functions_32ee2803b259760e.txt Checking filename extracted_symbols_32ee2803b259760e.txt Processing file extracted_symbols_32ee2803b259760e.txt Checking filename json_32ee2803b259760e Splitting into validation set and training set... Writing unseen training attract.txt and repulse.txt... Attraction: Requested 200000 pairs with 0 available. Traceback (most recent call last): File "./generate_training_data.py", line 624, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./generate_training_data.py", line 614, in main WriteUnseenTrainingAndValidationData(symbol_to_files_and_address, FLAGS) File "./generate_training_data.py", line 515, in WriteUnseenTrainingAndValidationData number_of_pairs=FLAGS.unseen_training_samples) File "./generate_training_data.py", line 451, in WriteAttractAndRepulseFromMap repulsion_set = GenerateRepulsionPairs( input_map, number_of_pairs ) File "./generate_training_data.py", line 464, in GenerateRepulsionPairs replace=False ) File "mtrand.pyx", line 908, in numpy.random.mtrand.RandomState.choice ValueError: 'a' cannot be empty unless no samples are taken

What's the solutions, Or can i generate the training data using my own file? Looking forward to your early reply. Thank you