packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

Add multiprocessing to dataset feature extraction #105

Closed AlexVanMechelen closed 5 months ago

AlexVanMechelen commented 6 months ago

This PR adds:

dhondta commented 6 months ago

@AlexVanMechelen Not sure parallelism will significantly improve performance as feature extraction operations are essentially IO-bound and not CPU-bound, hence it may be more suitable to use multi-threading. Did you test your proposal and notice a significant increase in performance ?

AlexVanMechelen commented 6 months ago

@dhondta For the new CFG-based features CPU processing power forms the limiting factor.

Experiment with 100 samples of various packer categories + non-packed samples -> computation of 1 feature "number_of_nodes" 1) No multiprocessing 38:02 for 100 samples -> 0.044 sample/s 2) Multiprocessing with 32 CPU cores 9:22 for 100 samples -> 0.178 sample/s -> 4x increase compared to no multiprocessing

Note: This is so slow (4x increase while number of CPUs went 32x) due to the issue #106 -> the last few executables require WAY more extraction time due to this issue. The rate for the first 89 samples was:

0:57 for the first 89 samples -> 1.561 sample/s -> 35x increase compared to no multiprocessing