packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
49 stars 10 forks source link

Inconsistent metadata while making dataset #9

Closed dhondta closed 1 year ago

dhondta commented 2 years ago

Steps to reproduce:

  1. dataset make test-upx -n 5 -f PE -p upx
  2. dataset make test-upx -n 20 -f PE -p upx
  3. dataset show test-upx

Issue: While all the samples got labelled, the Labelled value is not 100%. While inspecting the metadata.json of the dataset, the total is not 25.

dhondta commented 2 years ago

The issue seems to come from pandas while used when saving a dataset ;

>>> import pandas as pd
>>> data = pd.read_csv("/root/.packing-box/datasets/test-upx/data.csv", sep=";", parse_dates=['ctime', 'mtime'])
>>> data.label.value_counts().to_dict()
{'upx': 13}

While saving the dataset after making new samples, the counts of the metadata show a "" key with an inconsistent count.

dhondta commented 1 year ago

This issue was fixed with a previous commit.