packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
44 stars 10 forks source link

`KeyError` using modified features set #44

Closed smarbal closed 1 year ago

smarbal commented 1 year ago

Description

Issue occurs when training a model on a file-less dataset that has been made with a reduced features set. The box is up to date, the features configuration file is in the same state as when the dataset was made.

Traceback

┌──[user@packing-box]──[/mnt/share]──[main|+2]────────                                                                                   ────[172.17.0.4]──[13:41:20]────
$ model train fs-ds-PE -a kmeans
00:00:03.157 [INFO] Selected algorithm: K-Means clustering
00:00:03.165 [INFO] Reference dataset:  fs-ds-PE(PE32,PE64)
00:00:03.167 [INFO] Loading features...
Traceback (most recent call last):
  File "/home/user/.opt/tools/model", line 118, in <module>
    getattr(name, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 527, in train
    if not self._prepare(**kw):
  File "/home/user/.local/lib/python3.10/site-packages/pbox/learning/model.py", line 213, in _prepare
    self._data = ds._data[list(exe.features.keys())]
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3811, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/user/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['byte_0_after_ep', 'byte_1_after_ep', 'byte_2_after_ep', 'byte_3_after_ep', 'byte_4_after_ep', 'byte_5_after_ep', 'byte_6_after_ep', 'byte_7_after_ep', 'byte_8_after_ep', 'byte_9_after_ep', 'byte_10_after_ep', 'byte_11_after_ep', 'byte_12_after_ep', 'byte_13_after_ep', 'byte_14_after_ep', 'byte_15_after_ep', 'byte_16_after_ep', 'byte_17_after_ep', 'byte_18_after_ep', 'byte_19_after_ep', 'byte_20_after_ep', 'byte_21_after_ep', 'byte_22_after_ep', 'byte_23_after_ep', 'byte_24_after_ep', 'byte_25_after_ep', 'byte_26_after_ep', 'byte_27_after_ep', 'byte_28_after_ep', 'byte_29_after_ep', 'byte_30_after_ep', 'byte_31_after_ep', 'byte_32_after_ep', 'byte_33_after_ep', 'byte_34_after_ep', 'byte_35_after_ep', 'byte_36_after_ep', 'byte_37_after_ep', 'byte_38_after_ep', 'byte_39_after_ep', 'byte_40_after_ep', 'byte_41_after_ep', 'byte_42_after_ep', 'byte_43_after_ep', 'byte_44_after_ep', 'byte_45_after_ep', 'byte_46_after_ep', 'byte_47_after_ep', 'byte_48_after_ep', 'byte_49_after_ep', 'byte_50_after_ep', 'byte_51_after_ep', 'byte_52_after_ep', 'byte_53_after_ep', 'byte_54_after_ep', 'byte_55_after_ep', 'byte_56_after_ep', 'byte_57_after_ep', 'byte_58_after_ep', 'byte_59_after_ep', 'byte_60_after_ep', 'byte_61_after_ep', 'byte_62_after_ep', 'byte_63_after_ep'] not in index"
dhondta commented 1 year ago

@smarbal Please share your data.csv. I guess you use some kind of corrupted data.

smarbal commented 1 year ago

The dataset and it's features.yml file are here : https://github.com/packing-box/experiments-unsupervised-learning/tree/main/datasets/reduced-bytes-after-EP-features

Today I was able to create a dataset and train a model on it with a reduced features set by completely removing the feature from the configuration file. So this might be a problem with how keep: False is processed. I've left the keyword in the linked configuration file so the error is reproducible.

dhondta commented 1 year ago

@smarbal Do you still experience the same issue ?

dhondta commented 1 year ago

Could not reproduce the issue anymore. This may have been fixed in a previous commit.