packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

Unrealistic model performance #110

Closed AlexVanMechelen closed 7 months ago

AlexVanMechelen commented 7 months ago

Issue

I kept getting unrealistic model performances of 100% for each metric in any experiment, so I pulled it to the extreme as a POC:

Demo experiment

Using just 1 randomly selected feature byte_17_after_ep of which I believe it has little predictive power for datasets with a high variation of packer families, a RF model was trained on a dataset with many different compressor families (very low probability that the 17th byte after the EP has a common trend for all of them, never occurring in any of the not-packed samples).

for P in ASPack BeRoEXEPacker MEW MPRESS NSPack Packman PECompact UPX; do dataset update tmp -n 50 -s dataset-packed-pe/packed/$P -l dataset-packed-pe/labels/labels-compressor.json; done
dataset update tmp -s dataset-packed-pe/not-packed -n 400
dataset select -n 200 -s tmp tmp2

Listing the datasets:

dataset list

Datasets (10)

  Name    #Executables   Size    Files       Formats            Packers      
 ───────────────────────────────────────────────────────────────────────────
  tmp     600            164MB   yes     PE                 compressor{307}  
  tmp2    200            32MB    yes     PE                 compressor{93}

Training the model gives perfect metrics:

model train tmp -A rf
<<snipped>>
Classification metrics                                              

    .     Accuracy   Precision   Recall    F-Measure    MCC    AUC  
 ────────────────────────────────────────────────────────────────── 
  Train   100.00%    100.00%     100.00%   100.00%     0.00%   -    
  Test    100.00%    100.00%     100.00%   100.00%     0.00%   -   

Testing the model with a dataset with no overlap also gives perfect metrics:

model test tmp_pe_600_rf_f1 tmp2
<<snipped>>
Classification metrics                                                        

  Accuracy   Precision   Recall    F-Measure    MCC    AUC   Processing Time  
 ──────────────────────────────────────────────────────────────────────────── 
  100.00%    100.00%     100.00%   100.00%     0.00%   -     10.816ms        

Question

Am I maybe doing something wrong?

dhondta commented 7 months ago

If I get it right, you use a single feature to train your model ?

On Mon, 22 Apr 2024, 20:52 Alex Van Mechelen, @.***> wrote:

Issue

I kept getting unrealistic model performances of 100% for each metric in any experiment, so I pulled it to the extreme as a POC: Demo experiment

Using just 1 randomly selected feature byte_17_after_ep of which I believe it has little predictive power for datasets with a high variation of packer families, a RF model was trained on a dataset with many different compressor families (very low probability that the 17th byte after the EP has a common trend for all of them, never occurring in any of the not-packed samples).

for P in ASPack BeRoEXEPacker MEW MPRESS NSPack Packman PECompact UPX; do dataset update tmp -n 50 -s dataset-packed-pe/packed/$P -l dataset-packed-pe/labels/labels-compressor.json; done dataset update tmp -s dataset-packed-pe/not-packed -n 400 dataset select -n 200 -s tmp tmp2

Listing the datasets:

dataset list

Datasets (10)

Name #Executables Size Files Formats Packers ─────────────────────────────────────────────────────────────────────────── tmp 600 164MB yes PE compressor{307} tmp2 200 32MB yes PE compressor{93}

Training the model gives perfect metrics:

model train tmp -A rf <> Classification metrics

.     Accuracy   Precision   Recall    F-Measure    MCC    AUC

────────────────────────────────────────────────────────────────── Train 100.00% 100.00% 100.00% 100.00% 0.00% - Test 100.00% 100.00% 100.00% 100.00% 0.00% -

Testing the model with a dataset with no overlap also gives perfect metrics:

model test tmp_pe_600_rf_f1 tmp2 <> Classification metrics

Accuracy Precision Recall F-Measure MCC AUC Processing Time ──────────────────────────────────────────────────────────────────────────── 100.00% 100.00% 100.00% 100.00% 0.00% - 10.816ms

Question

Am I maybe doing something wrong?

— Reply to this email directly, view it on GitHub https://github.com/packing-box/docker-packing-box/issues/110, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFPVBWLYZ4ZJOOKGF2ITNTY6VL7ZAVCNFSM6AAAAABGTKQO4KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TOMRUGY2TCMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AlexVanMechelen commented 7 months ago

@dhondta For the above demo yes, to emphasise that 100% on all metric is unrealistic in that scenario. Besides the above experiment I've tried many other configurations always resulting in perfect metrics

AlexVanMechelen commented 7 months ago

Conclusion

The binary classifier looks for samples labeled as "not-packed" and labels them "False", while any other label gets put as "true". Non-labeled samples are rejected and don't make it to the model training. Therefore only one class arrives in the model training, yielding perfect metrics.

Feature

It would be useful to be able to specify with for example a flag "-L" in the dataset convert command to assign the "not-packed" label to those features. This would allow to perform experiments where class 1 = "cryptors" and class2 comprises of non-cryptors (including samples packed with packers not belonging to the cryptor category, but also not-packed samples), in this case all labeled as "not-packed" for correct interpretation by the tool

dhondta commented 7 months ago

@AlexVanMechelen see commit 8112fc59 ; you can now use -T with model train to solve this issue. Please test and report.

AlexVanMechelen commented 7 months ago

Tested & functional. Encountered one issue, fixed in #114