packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

ValueError while using `model train` #2

Closed lagitannerie closed 2 years ago

lagitannerie commented 2 years ago
# dataset -v make pe-upx-dataset -f PE -a -p upx -n 400 -s /mnt/share/dataset-packed-pe/not-packed
...
# model train pe-upx-dataset --algorithm dt

00:00:01.818 [INFO] Selected algorithm: Decision Tree
00:00:01.819 [INFO] Reference dataset:  pe-upx-dataset(PE32)
00:00:01.820 [INFO] Computing features...
00:00:54.732 [INFO] Making pipeline...
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 511, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 295, in _prepare
    train_test_split(self._data, self._target, test_size=.2, random_state=42, stratify=self._target)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_split.py", line 2430, in train_test_split
    arrays = indexable(*arrays)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 433, in indexable
    check_consistent_length(*result)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 387, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [409, 400]
dhondta commented 2 years ago

Hi @lagitannerie ! Thank you for reporting this. I allowed myself to split your issue in two as the errors you got pertain to different things.

dhondta commented 2 years ago

Problem: This comes from a bug in the dataset generation I did not fix yet. When packing fails for a file, its copy in the files folder of dataset's folder structure remains while it should have removed. Using model train relies on dataset's data.csv and the content of the files folder for computing features, therefore finding a different count (as data.csv does not get updated if packing fails while the file is indeed present in files). In your example, you got 9 errors with the UPX packers while 400 samples where retained, hence having 409 executables copied in dataset's ´files´ folder. Workaround: Use dataset fix pe-upx-dataset to fix your dataset.

I will try to troubleshoot this issue very soon.