packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

Labels all set to 1 in visualization #66

Closed smarbal closed 1 year ago

smarbal commented 1 year ago

Description

When visualizing a model, all executables appear as packed, even though it is not the case.

Steps to reproduce

  1. Generate dataset : dataset make upx-PE -p upx -f PE
  2. Train model: model train upx-PE -a mbkmeans
  3. Visualize model: model visualize -e upx-PE_pe32-pe64_99_mbkmeans_f111

Additional information

By printing params['target'] in visualization.py, all labels are indeed set to 1 so it's not a visualization problem. Used datasets :

   Name    #Executables   Size    Files    Formats    Packers  
 ───────────────────────────────────────────────────────────── 
  fs-upx   99             114KB   no      PE32,PE64   upx{35}  
  upx-PE   99             45MB    yes     PE32,PE64   upx{35} 

fs-upx is the fileless version of the dataset, which also yields the same bug.

dhondta commented 1 year ago

@smarbal If you run model browse upx-PE_pe32-pe64_99_mbkmeans_f111, do you see the right labels in column cluster ?

dhondta commented 1 year ago

@smarbal I got it ; if you run model -v test upx-PE_pe32-pe64_99_mbkmeans_f111 upx-PE, you will point out that true labels are all 1's instead of predicted labels (which may even be all correct). This is likely to come from a bug in label mapping of the y_true vector. I will try to fix this ASAP.

smarbal commented 1 year ago

@dhondta The issue seems to come from line 212 in ../learning/model.py. After the line 209, all labels of NOT_PACKED instances are replaced by None. But then, at line 212, the fillna() function replaces those labels by NOT_LABELLED since those labels are None. The mapping at line 214 can't work correctly then since NOT_PACKED instances will have a '?' label which is not correct.

Maybe changing the value of NOT_PACKED in LABELS_BACK_CONV from None to 0 could be a solution ?

dhondta commented 1 year ago

Solved with 3c8a40fcb710feef073b865022d50632da014ebb