packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

`KeyError` while using `model train` with ELF #8

Closed dhondta closed 2 years ago

dhondta commented 2 years ago

For a dataset composed of ELF files, model train produces this error :

model train test-dataset -a bnb
00:00:02.732 [INFO] Selected algorithm: Bernoulli Naive Bayes
00:00:02.734 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.735 [INFO] Computing features...
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/executable.py", line 32, in features
    return {n: f.description for n, f in Features.registry[self.format].items()}
KeyError: None
dhondta commented 2 years ago

Hi @smarbal I adapted the code to be more precise for this error. Please retry and post the traceback.

smarbal commented 2 years ago

Here's the new traceback :

model train test-dataset -a kmeans
00:00:02.276 [INFO] Selected algorithm: K-Means clustering
00:00:02.278 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.279 [INFO] Computing features...
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/executable.py", line 32, in features
    return {n: f.description for n, f in Features.registry[self.format].items()}
  File "/usr/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/executable.py", line 126, in format
    raise ValueError("'%s' has signature '%s' which is not supported" % (self, self.filetype))
ValueError: '/root/.packing-box/datasets/test-dataset/files/06d986b913b685936b365565b5204867aa2d388cdec3c5d5f9810561c31fb8f9' has signature 'POSIX shell script executable (binary data)' which is not supported
dhondta commented 2 years ago

For an unknown reason, it seems that there is a script that was included at the generation of the dataset, causing executable-related computation to fail as its format attribute is None. I need to inspect the dataset generation workflow to prevent from adding files that have their format attribute set to None.

dhondta commented 2 years ago

@smarbal you can try dataset fix test-dataset and retry your command.

smarbal commented 2 years ago

@dhondta I ran dataset fix test-dataset and ran into this issue after retrying the command :

# model train test-dataset -a kmeans
00:00:02.016 [INFO] Selected algorithm: K-Means clustering
00:00:02.017 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.018 [INFO] Computing features...
00:00:03.087 [WARNING] Bad expression: checksum == 0
00:00:03.087 [ERROR] name 'checksum' is not defined
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
TypeError: 'NoneType' object is not iterable

dataset : test-dataset 20 3MB yes ELF64 {11},upx{3},gzexe{1},midgetpack{1},upx-3.92{1},upx-3.94{1},upx-3.95{1},ward{1}

By creating a new dataset with only the UPX packer, I was able to train a model and didn't run into this issue.

dhondta commented 2 years ago

@smarbal OK, this is expected when exe.format is None. This means that, even after having used the fix command, there still remains at least one non-executable file in your dataset. This part is fixed with e81b4eada878ae13aedc0a9e0199046238c95674. I did not spot the issue for the dataset generation.

dhondta commented 2 years ago

@smarbal You can try dataset fix test-dataset again (after a pbox-update, of course) and retry your command.

smarbal commented 2 years ago

@dhondta Error occurs on dataset fix test-dataset :

dataset fix test-dataset
Traceback (most recent call last):
  File "/opt/tools/dataset", line 149, in <module>
    getattr(ds, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/utils.py", line 147, in _wrapper
    return f(s, *a, **kw)
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/dataset.py", line 339, in fix
    if exe.format is None:  # unsupported or bad format (e.g. Bash script)
AttributeError: 'Path' object has no attribute 'format'
dhondta commented 2 years ago

@smarbal My bad. Once again, please.

smarbal commented 2 years ago

@dhondta Worked well, thanks !