packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
44 stars 10 forks source link

`ValueError` for `train_test_split()` using unsupervised model #41

Closed smarbal closed 1 year ago

smarbal commented 1 year ago

When training an unsupervised model, the following error occurs :

$ model train upx-merged -a kmeans 
00:00:03.540 [INFO] Selected algorithm: K-Means clustering
00:00:03.542 [INFO] Reference dataset:  upx-non(PE32,PE64)
00:00:03.543 [INFO] Computing features...
00:17:51.300 [INFO] Making pipeline...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 116, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 519, in train
    if not self._prepare(**kw):
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 296, in _prepare
    train_test_split(self._data, self._target, test_size=tsize, random_state=42, stratify=self._target)
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2448, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/home/user/.local/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2071, in _validate_shuffle_split
    raise ValueError(
ValueError: test_size=0 should be either positive and smaller than the number of samples 1909 or a float in the (0, 1) range
dhondta commented 1 year ago

@smarbal This should be solved by f7dabd7d82455d4028adc2d781056be345f36031. Please test.

smarbal commented 1 year ago

Th following error occurs now :

┌──[user@packing-box]──[/mnt/share]──[main|✓]──[✘ INT]────────                                                                           ────[172.17.0.4]──[19:38:41]────
$ model train upx-PE -a kmeans 
00:00:03.400 [INFO] Selected algorithm: K-Means clustering
00:00:03.401 [INFO] Reference dataset:  upx-PE(PE32,PE64)
00:00:03.403 [INFO] Computing features...
00:00:59.784 [INFO] Making pipeline...
00:00:59.787 [INFO] Training model...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 588, in train
    self.pipeline.fit(self._train.data, self._train.target.values.ravel())
AttributeError: 'numpy.ndarray' object has no attribute 'values'

Removing both to_numpy() on line 295 of model.py seems to fix the issue.

dhondta commented 1 year ago

@smarbal My bad, I thought the variables were of type numpy.array. I will fix this ASAP.

dhondta commented 1 year ago

@smarbal 304dd5d00455b2e510428999f1a571225e024605 should fix this. Please test.

smarbal commented 1 year ago

@dhondta Works as intended, thanks.