packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
44 stars 10 forks source link

Add parameter to drop labels for clustering metrics #48

Closed smarbal closed 1 year ago

smarbal commented 1 year ago

Improvement suggestion

Because clustering algorithms can have metrics that use labels or not, it could be interesting to add a parameter to the train method that would allow to only keep the metrics that don't use labels.

dhondta commented 1 year ago

@smarbal Please test...

smarbal commented 1 year ago

Hello, Since commit 95b94c4, the metrics aren't printed at the end of the training anymore.

Steps to reproduce

Example

model train upx-PE -a kmeans --ignore-labels
00:00:03.692 [INFO] Selected algorithm: K-Means clustering
00:00:03.693 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.694 [INFO] Computing features...
00:00:37.711 [INFO] Making pipeline...
00:00:37.714 [INFO] Training model...
00:00:37.714 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:37.718 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109

00:00:38.224 [INFO] Parameters:
- n_clusters = 8
- n_init = 10
- max_iter = 300
- tol = 0.0001
- algorithm = lloyd

The clustering metrics are present in the algorithms configuration file and before the commit the command outputted this :

$ model train upx-PE -a kmeans --ignore-labels
00:00:03.502 [INFO] Selected algorithm: K-Means clustering
00:00:03.503 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.505 [INFO] Computing features...
00:00:39.249 [INFO] Making pipeline...
00:00:39.252 [INFO] Training model...
00:00:39.252 [INFO] (step 1/2) Processing standardize (StandardScaler)
00:00:39.256 [INFO] (step 2/2) Processing kmeans

Name: upx-PE_pe32-pe64_100_kmeans_f109

  ─────  ────────────────  ───────────────────────  ────────────────────
  .      Silhouette Score  Calinski Harabasz Score  Davies Bouldin Score
  Train  -0.216            8.474                    10.921
  ─────  ────────────────  ───────────────────────  ────────────────────
...
dhondta commented 1 year ago

@smarbal You can see why by using model -v train .... I guess the predictions are all -1 meaning no label. I think there is something failing in the pipeline but did not figure out what yet.

smarbal commented 1 year ago

@dhondta Verbosity didn't give any more information. I tested changing parts of the code and by reverting _convert_output to it's previous state, I managed to have the metrics. My guess is that the problem comes from this condition : if all(x == LABELS_BACK_CONV[NOT_LABELLED] for x in yp).

dhondta commented 1 year ago

@smarbal Please test.

smarbal commented 1 year ago

Works as intended. Thank you.