Labels all set to 1 in visualization

smarbal commented 1 year ago

Description

When visualizing a model, all executables appear as packed, even though it is not the case.

Steps to reproduce

Generate dataset : dataset make upx-PE -p upx -f PE
Train model: model train upx-PE -a mbkmeans
Visualize model: model visualize -e upx-PE_pe32-pe64_99_mbkmeans_f111

Additional information

By printing params['target'] in visualization.py, all labels are indeed set to 1 so it's not a visualization problem. Used datasets :

   Name    #Executables   Size    Files    Formats    Packers  
 ───────────────────────────────────────────────────────────── 
  fs-upx   99             114KB   no      PE32,PE64   upx{35}  
  upx-PE   99             45MB    yes     PE32,PE64   upx{35}

fs-upx is the fileless version of the dataset, which also yields the same bug.

dhondta commented 1 year ago

@smarbal If you run model browse upx-PE_pe32-pe64_99_mbkmeans_f111, do you see the right labels in column cluster ?

dhondta commented 1 year ago

@smarbal I got it ; if you run model -v test upx-PE_pe32-pe64_99_mbkmeans_f111 upx-PE, you will point out that true labels are all 1's instead of predicted labels (which may even be all correct). This is likely to come from a bug in label mapping of the y_true vector. I will try to fix this ASAP.

smarbal commented 1 year ago

@dhondta The issue seems to come from line 212 in ../learning/model.py. After the line 209, all labels of NOT_PACKED instances are replaced by None. But then, at line 212, the fillna() function replaces those labels by NOT_LABELLED since those labels are None. The mapping at line 214 can't work correctly then since NOT_PACKED instances will have a '?' label which is not correct.

Maybe changing the value of NOT_PACKED in LABELS_BACK_CONV from None to 0 could be a solution ?

dhondta commented 1 year ago

Solved with 3c8a40fcb710feef073b865022d50632da014ebb

packing-box / docker-packing-box