packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection
GNU General Public License v3.0
44 stars 10 forks source link

Dataset merge issue #146

Closed jramhani closed 3 weeks ago

jramhani commented 1 month ago

Merging issue

I tried to merge a baseline dataset with its altered version in a new mixed dataset. The result of the command is a dataset having only samples from one and not the other.

I suspect the filename HASH that is not updated after alteration, thus the merge command sees conflicting names

Datasets (87)

                 Name                   #Executables   Altered   Size    Files      Formats                     Packers                
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 

  upx_baseline                          500            -         121MB   yes     PE32,PE64        upx{500}                             
  original_move_ep                      500            100.00%   122MB   yes     PE32,PE64        upx{500}                             
  mixed_original_move_ep                500            -         121MB   yes     PE32,PE64        upx{500}                             
dataset merge original_move_ep upx_baseline -n mixed_original_move_ep
dhondta commented 1 month ago

@jramhani The issue comes from a design choice ; samples get named as of their SHA256, either if they are cleanware or packed, meaning that when you create a dataset of N cleanware, that you mass-pack the same samples to merge them with the cleanware, instead of getting 2*N samples, you will get your original dataset as packed samples won't update cleanware ones (as of the current behavior, samples' metadata won't get updated).

The normal way of working is to use separate datasets.

Workaround: You can eventually ingest samples from the packed dataset with dataset ingest ... so that samples get imported and renamed according to the SHA256 of their packed version but you will need to provide the labels in a JSON file.