`KeyError` while using `model train` with ELF

dhondta commented 2 years ago

For a dataset composed of ELF files, model train produces this error :

model train test-dataset -a bnb
00:00:02.732 [INFO] Selected algorithm: Bernoulli Naive Bayes
00:00:02.734 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.735 [INFO] Computing features...
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/executable.py", line 32, in features
    return {n: f.description for n, f in Features.registry[self.format].items()}
KeyError: None

dhondta commented 2 years ago

Hi @smarbal I adapted the code to be more precise for this error. Please retry and post the traceback.

smarbal commented 2 years ago

Here's the new traceback :

model train test-dataset -a kmeans
00:00:02.276 [INFO] Selected algorithm: K-Means clustering
00:00:02.278 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.279 [INFO] Computing features...
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/executable.py", line 32, in features
    return {n: f.description for n, f in Features.registry[self.format].items()}
  File "/usr/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/executable.py", line 126, in format
    raise ValueError("'%s' has signature '%s' which is not supported" % (self, self.filetype))
ValueError: '/root/.packing-box/datasets/test-dataset/files/06d986b913b685936b365565b5204867aa2d388cdec3c5d5f9810561c31fb8f9' has signature 'POSIX shell script executable (binary data)' which is not supported

dhondta commented 2 years ago

For an unknown reason, it seems that there is a script that was included at the generation of the dataset, causing executable-related computation to fail as its format attribute is None. I need to inspect the dataset generation workflow to prevent from adding files that have their format attribute set to None.

dhondta commented 2 years ago

@smarbal you can try dataset fix test-dataset and retry your command.

smarbal commented 2 years ago

@dhondta I ran dataset fix test-dataset and ran into this issue after retrying the command :

# model train test-dataset -a kmeans
00:00:02.016 [INFO] Selected algorithm: K-Means clustering
00:00:02.017 [INFO] Reference dataset:  test-dataset(ELF64)
00:00:02.018 [INFO] Computing features...
00:00:03.087 [WARNING] Bad expression: checksum == 0
00:00:03.087 [ERROR] name 'checksum' is not defined
Traceback (most recent call last):
  File "/opt/tools/model", line 117, in <module>
    getattr(m, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 521, in train
    if not self._prepare(**kw):
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 220, in _prepare
    __parse(ds.files.listdir(is_executable), False)
  File "/usr/local/lib/python3.8/dist-packages/pbox/learning/model.py", line 203, in __parse
    self._features.update(exe.features)
TypeError: 'NoneType' object is not iterable

dataset : test-dataset 20 3MB yes ELF64 {11},upx{3},gzexe{1},midgetpack{1},upx-3.92{1},upx-3.94{1},upx-3.95{1},ward{1}

By creating a new dataset with only the UPX packer, I was able to train a model and didn't run into this issue.

dhondta commented 2 years ago

@smarbal OK, this is expected when exe.format is None. This means that, even after having used the fix command, there still remains at least one non-executable file in your dataset. This part is fixed with e81b4eada878ae13aedc0a9e0199046238c95674. I did not spot the issue for the dataset generation.

dhondta commented 2 years ago

@smarbal You can try dataset fix test-dataset again (after a pbox-update, of course) and retry your command.

smarbal commented 2 years ago

@dhondta Error occurs on dataset fix test-dataset :

dataset fix test-dataset
Traceback (most recent call last):
  File "/opt/tools/dataset", line 149, in <module>
    getattr(ds, args.command)(**vars(args))
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/utils.py", line 147, in _wrapper
    return f(s, *a, **kw)
  File "/usr/local/lib/python3.8/dist-packages/pbox/common/dataset.py", line 339, in fix
    if exe.format is None:  # unsupported or bad format (e.g. Bash script)
AttributeError: 'Path' object has no attribute 'format'

dhondta commented 2 years ago

@smarbal My bad. Once again, please.

smarbal commented 2 years ago

@dhondta Worked well, thanks !

packing-box / docker-packing-box

`KeyError` while using `model train` with ELF #8