Refactor `metrics` method

smarbal commented 1 year ago

Improvement suggestion As discussed, in order to have a metrics method for unsupervised models, it would be necessary to move the method from utils.py to learning.py

dhondta commented 1 year ago

Hi @smarbal ! Support for extended performance metrics was just introduced with e8b4f42122155325f67b2925afafd0f65f0495ab. You can start adding clustering metrics from here.

dhondta commented 1 year ago

@smarbal kind reminder...

dhondta commented 1 year ago

@smarbal 304dd5d00455b2e510428999f1a571225e024605 introduced support for using several metrics categories with algorithms. This has been adapted too in model training and testing. Can you please test this on your current use cases and provide feedback ?

smarbal commented 1 year ago

@dhondta Following #43, I now stumble on the following error when training an unsupervised model :

┌──[user@packing-box]──[/mnt/share]──[main|+2]────────                                          ────[172.18.0.3]──[15:33:32]────
$ model train upx-PE -a kmeans 
00:00:02.582 [INFO] Selected algorithm: K-Means clustering
00:00:02.583 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:02.584 [INFO] Computing features...
00:08:55.270 [INFO] Making pipeline...
00:08:55.275 [INFO] Training model...
00:08:55.275 [INFO] (step 1/1) Processing kmeans
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 595, in train
    self._train.predict_proba = self.pipeline.predict_proba(self._train.data)[:, 1]
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 70, in __getattribute__
    return object.__getattribute__(object.__getattribute__(self, "pipeline"), name)
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/utils/_available_if.py", line 32, in __get__
    if not self.check(obj):
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 47, in check
    getattr(self._final_estimator, attr)
AttributeError: 'KMeans' object has no attribute 'predict_proba'

dhondta commented 1 year ago

@smarbal f4f145a61553f68334702b7648a130e64c7a6fce should fix this (some algorithms do not support the predict_proba method).

smarbal commented 1 year ago

@dhondta The following error occurs :

┌──[user@packing-box]──[/mnt/share]──[main|✓]────────                                           ────[172.18.0.3]──[17:42:31]────
$ model train upx-PE -a kmeans 
00:00:01.608 [INFO] Selected algorithm: K-Means clustering
00:00:01.609 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:01.610 [INFO] Computing features...
00:08:36.536 [INFO] Making pipeline...
00:08:36.540 [INFO] Training model...
00:08:36.540 [INFO] (step 1/1) Processing kmeans

Name: upx-PE_pe32-pe64_1909_kmeans_f126

Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 611, in train
    if len(s) > 0:
TypeError: object of type 'Dummy' has no len()

dhondta commented 1 year ago

@smarbal Stupid mistake of mine, I forgot to include the data attribute. This is fixed in 542e044d4e457765d134998a6a290cd3c4359564.

smarbal commented 1 year ago

@dhondta With classification metrics, the following error occurs :

┌──[user@packing-box]──[/mnt/share]──[main|✓]────────                                ────[172.18.0.3]──[18:25:33]────
$ model train upx-PE -a kmeans 
00:00:01.146 [INFO] Selected algorithm: K-Means clustering
00:00:01.147 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:01.148 [INFO] Computing features...
00:09:04.796 [INFO] Making pipeline...
00:09:04.801 [INFO] Training model...
00:09:04.801 [INFO] (step 1/1) Processing kmeans

Name: upx-PE_pe32-pe64_1909_kmeans_f126

Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 612, in train
    m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 142, in _metrics
    values, headers = m(data, prediction, y_true=target, y_proba=proba, proctime=proctime, logger=self.logger)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 61, in _wrapper
    r = f(*a, **kw)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 148, in classification_metrics
    yt, yp, ypr, d = _map_values_to_integers(y_true, y_pred, y_proba, **kw)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 118, in _map_values_to_integers
    tn, fp, fn, tp = confusion_matrix(*arrays[:2]).ravel()
ValueError: too many values to unpack (expected 4)

I tried with only clustering metrics and the following error occurred :

model train fileless-upx-PE -a kmeans 
00:00:01.458 [INFO] Selected algorithm: K-Means clustering
00:00:01.459 [INFO] Reference dataset:  fileless-upx-PE(PE32,PE64)
00:00:01.460 [INFO] Loading features...
00:00:02.489 [INFO] Making pipeline...
00:00:02.495 [INFO] Training model...
00:00:02.495 [INFO] (step 1/1) Processing kmeans

Name: fileless-upx-PE_pe32-pe64_1909_kmeans_f126

00:00:03.661 [ERROR] Bad metrics type 'clustering'
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 612, in train
    m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
TypeError: cannot unpack non-iterable NoneType object

dhondta commented 1 year ago

@smarbal Fixed with 2cebfccaff800c300788d3c3bdc5b75544ee890f.

smarbal commented 1 year ago

@dhondta The following issue happens now (for clustering metrics) :

┌──[user@packing-box]──[/mnt/share]──[main|+1]────────                                                             ────[172.18.0.3]──[11:13:14]────
$ model train fileless-upx-PE -a kmeans 
00:00:02.965 [INFO] Selected algorithm: K-Means clustering
00:00:02.966 [INFO] Reference dataset:  fileless-upx-PE(PE32,PE64)
00:00:02.967 [INFO] Loading features...
00:00:02.000 [INFO] Making pipeline...
00:00:02.006 [INFO] Training model...
00:00:02.006 [INFO] (step 1/1) Processing kmeans

Name: fileless-upx-PE_pe32-pe64_1909_kmeans_f126

Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 613, in train
    m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 142, in _metrics
    values, headers = m(data, prediction, y_true=target, y_proba=proba, proctime=proctime, logger=self.logger)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 62, in _wrapper
    r = f(*a, **kw)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 172, in clustering_metrics
    yt, yp, _ = _map_values_to_integers(y_true, y_pred, **kw)
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 119, in _map_values_to_integers
    tn, fp, fn, tp = confusion_matrix(*arrays[:2]).ravel()
ValueError: too many values to unpack (expected 4)

packing-box / docker-packing-box

Refactor `metrics` method #24