honeynet / cuckooml

CuckooML: Machine Learning for Cuckoo Sandbox
https://honeynet.github.io/cuckooml/
146 stars 52 forks source link

Sorting in clustering_results.csv #17

Closed ghost closed 7 years ago

ghost commented 7 years ago

Hi @So-Cool

The sorting issue in clustering_results.csv is as follows: 1,10..19,2,20..[sample end 62], 7,8,9

I'm currently trying to create my own ground truth labels list, which means I will have to account for that sorting mistake when creating my own list. I'm wondering whether the ground truth labels generated by CuckooML are in sync with the clustering results, i.e. are they subject to the same bug or does it only affect the one list?

So-Cool commented 7 years ago

Hi @dueland Sorry for the delay; I've been quite busy recently.

Because sample ID is a string rather than an integer (i.e. '10' and not 10) all of the dataframes use the ordering that you have mentioned. It's not a bug per se, but you need to be really careful about it to avoid any kind of mismatching.