Closed smarbal closed 1 year ago
Hi @smarbal ! Support for extended performance metrics was just introduced with e8b4f42122155325f67b2925afafd0f65f0495ab. You can start adding clustering metrics from here.
@smarbal kind reminder...
@smarbal 304dd5d00455b2e510428999f1a571225e024605 introduced support for using several metrics categories with algorithms. This has been adapted too in model training and testing. Can you please test this on your current use cases and provide feedback ?
@dhondta Following #43, I now stumble on the following error when training an unsupervised model :
┌──[user@packing-box]──[/mnt/share]──[main|+2]──────── ────[172.18.0.3]──[15:33:32]────
$ model train upx-PE -a kmeans
00:00:02.582 [INFO] Selected algorithm: K-Means clustering
00:00:02.583 [INFO] Reference dataset: upx-PE(PE32,PE64)
00:00:02.584 [INFO] Computing features...
00:08:55.270 [INFO] Making pipeline...
00:08:55.275 [INFO] Training model...
00:08:55.275 [INFO] (step 1/1) Processing kmeans
Traceback (most recent call last):
File "/home/user/.opt/tools/model", line 118, in <module>
getattr(name, args.command)(**vars(args))
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 595, in train
self._train.predict_proba = self.pipeline.predict_proba(self._train.data)[:, 1]
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 70, in __getattribute__
return object.__getattribute__(object.__getattribute__(self, "pipeline"), name)
File "/home/user/.local/lib/python3.10/site-packages/sklearn/utils/_available_if.py", line 32, in __get__
if not self.check(obj):
File "/home/user/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 47, in check
getattr(self._final_estimator, attr)
AttributeError: 'KMeans' object has no attribute 'predict_proba'
@smarbal f4f145a61553f68334702b7648a130e64c7a6fce should fix this (some algorithms do not support the predict_proba
method).
@dhondta The following error occurs :
┌──[user@packing-box]──[/mnt/share]──[main|✓]──────── ────[172.18.0.3]──[17:42:31]────
$ model train upx-PE -a kmeans
00:00:01.608 [INFO] Selected algorithm: K-Means clustering
00:00:01.609 [INFO] Reference dataset: upx-PE(PE32,PE64)
00:00:01.610 [INFO] Computing features...
00:08:36.536 [INFO] Making pipeline...
00:08:36.540 [INFO] Training model...
00:08:36.540 [INFO] (step 1/1) Processing kmeans
Name: upx-PE_pe32-pe64_1909_kmeans_f126
Traceback (most recent call last):
File "/home/user/.opt/tools/model", line 118, in <module>
getattr(name, args.command)(**vars(args))
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 611, in train
if len(s) > 0:
TypeError: object of type 'Dummy' has no len()
@smarbal Stupid mistake of mine, I forgot to include the data
attribute. This is fixed in 542e044d4e457765d134998a6a290cd3c4359564.
@dhondta With classification metrics, the following error occurs :
┌──[user@packing-box]──[/mnt/share]──[main|✓]──────── ────[172.18.0.3]──[18:25:33]────
$ model train upx-PE -a kmeans
00:00:01.146 [INFO] Selected algorithm: K-Means clustering
00:00:01.147 [INFO] Reference dataset: upx-PE(PE32,PE64)
00:00:01.148 [INFO] Computing features...
00:09:04.796 [INFO] Making pipeline...
00:09:04.801 [INFO] Training model...
00:09:04.801 [INFO] (step 1/1) Processing kmeans
Name: upx-PE_pe32-pe64_1909_kmeans_f126
Traceback (most recent call last):
File "/home/user/.opt/tools/model", line 118, in <module>
getattr(name, args.command)(**vars(args))
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 612, in train
m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 142, in _metrics
values, headers = m(data, prediction, y_true=target, y_proba=proba, proctime=proctime, logger=self.logger)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 61, in _wrapper
r = f(*a, **kw)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 148, in classification_metrics
yt, yp, ypr, d = _map_values_to_integers(y_true, y_pred, y_proba, **kw)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 118, in _map_values_to_integers
tn, fp, fn, tp = confusion_matrix(*arrays[:2]).ravel()
ValueError: too many values to unpack (expected 4)
I tried with only clustering metrics and the following error occurred :
model train fileless-upx-PE -a kmeans
00:00:01.458 [INFO] Selected algorithm: K-Means clustering
00:00:01.459 [INFO] Reference dataset: fileless-upx-PE(PE32,PE64)
00:00:01.460 [INFO] Loading features...
00:00:02.489 [INFO] Making pipeline...
00:00:02.495 [INFO] Training model...
00:00:02.495 [INFO] (step 1/1) Processing kmeans
Name: fileless-upx-PE_pe32-pe64_1909_kmeans_f126
00:00:03.661 [ERROR] Bad metrics type 'clustering'
Traceback (most recent call last):
File "/home/user/.opt/tools/model", line 118, in <module>
getattr(name, args.command)(**vars(args))
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 612, in train
m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
TypeError: cannot unpack non-iterable NoneType object
@smarbal Fixed with 2cebfccaff800c300788d3c3bdc5b75544ee890f.
@dhondta The following issue happens now (for clustering metrics) :
┌──[user@packing-box]──[/mnt/share]──[main|+1]──────── ────[172.18.0.3]──[11:13:14]────
$ model train fileless-upx-PE -a kmeans
00:00:02.965 [INFO] Selected algorithm: K-Means clustering
00:00:02.966 [INFO] Reference dataset: fileless-upx-PE(PE32,PE64)
00:00:02.967 [INFO] Loading features...
00:00:02.000 [INFO] Making pipeline...
00:00:02.006 [INFO] Training model...
00:00:02.006 [INFO] (step 1/1) Processing kmeans
Name: fileless-upx-PE_pe32-pe64_1909_kmeans_f126
Traceback (most recent call last):
File "/home/user/.opt/tools/model", line 118, in <module>
getattr(name, args.command)(**vars(args))
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 613, in train
m, h = self._metrics(s.data, s.target, s.predict, s.predict_proba, metric)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 142, in _metrics
values, headers = m(data, prediction, y_true=target, y_proba=proba, proctime=proctime, logger=self.logger)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 62, in _wrapper
r = f(*a, **kw)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 172, in clustering_metrics
yt, yp, _ = _map_values_to_integers(y_true, y_pred, **kw)
File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/metrics.py", line 119, in _map_values_to_integers
tn, fp, fn, tp = confusion_matrix(*arrays[:2]).ravel()
ValueError: too many values to unpack (expected 4)
Improvement suggestion As discussed, in order to have a
metrics
method for unsupervised models, it would be necessary to move the method fromutils.py
tolearning.py