packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

`KeyError` in `dataset update -l labels.json` #111

Closed AlexVanMechelen closed 7 months ago

AlexVanMechelen commented 7 months ago

Issue

When expanding a dataset with the dataset update tool and specifying their labels with -l path/to/labels.json, a KeyError occurs when a sample is encountered that is already part of the dataset. No error occurs when no labels are specified.

Reproduce

dataset update tmp -s dataset-packed-pe/not-packed -n 400 -l ./dataset-packed-pe/labels.json
dataset update tmp -s dataset-packed-pe/not-packed -n 400 -l ./dataset-packed-pe/labels.json

Traceback

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 82, in __getitem__
    row = self._data[self._data.hash == h].iloc[0]
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/home/user/.local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1752, in _getitem_axis
    self._validate_integer(key, axis)
  File "/home/user/.local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1685, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.opt/tools/dataset", line 206, in <module>
    getattr(ds, args.command)(**vars(args))
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 22, in _wrapper
    return f(s, *a, **kw)
           ^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 820, in update
    _update(exe)
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 798, in _update
    self[h] = (self._compute_features(e), True)  # True: force updating the row
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 177, in _compute_features
    d = self[exe.basename, True]  # retrieve executable's record as a dictionary
        ~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pbox/core/dataset/__init__.py", line 85, in __getitem__
    raise KeyError(h)
KeyError: 'shmnview.exe'
dhondta commented 7 months ago

@AlexVanMechelen I do not get this error. Are you sure you did not perform any other action on the tmp dataset that may have corrupted it ?

AlexVanMechelen commented 7 months ago

@dhondta I retried and also don't get the error anymore. My bash history doesn't go as far back as the creation of this temporary dataset and the dataset itself is also gone, so I can't retrace the context of the issue. I tried some variations of updating datasets with labeled and non-labeled samples with duplicates, but don't seem to be able to reproduce the issue, so I propose to close it for now & I'll post a new one if I would encounter this again.