iosifache / DikeDataset

Dataset with labeled benign and malicious files 🗃️
MIT License
100 stars 17 forks source link

What should I do from labeling step 3? #1

Closed recsater closed 1 year ago

recsater commented 2 years ago

To compute the membership on each malware family, a transformer was developed (see the observation above) to "vote" for each available family. For example, if an antivirus engine tag was Trj, then one vote for the trojan family was offered. All tags were consumed in this way and the votes for all families were normalized.

(see the observation above)

https://github.com/iosifache/dike/blob/main/codebase/scripts/continuous_vt_scan.py

I entered this link, but I didn't know from labeling step 3.

What should I do from labeling step 3?

From

image

To

image

iosifache commented 2 years ago

Hi, @recsater, The script only deals with dumping that raw data into a CSV file from Google Cloud Storage. After achieving the scanning step, you need to create your own labeling strategy or adapt the dike's one. You can check dike's implementation in the update_malware_labels function from dataset module. There, the votes and tags are processes to obtain the malice and the families' ownership.

recsater commented 2 years ago

Hi, @recsater, The script only deals with dumping that raw data into a CSV file from Google Cloud Storage. After achieving the scanning step, you need to create your own labeling strategy or adapt the dike's one. You can check dike's implementation in the update_malware_labels function from dataset module. There, the votes and tags are processes to obtain the malice and the families' ownership.

First of all, thank you for your reply.

As an additional question, I would like to get exactly the same constant used to make the DikeDataset labels.

Because I'm working on a project to classify malicious code using labels(malware.csv, benign.csv) from DikeDataset.

To do that, can I know the following values?

In Class DataFolderScanner, self._malware_families self._malicious_benign_votes_ratio self._min_ignored_percent

These are defined like image

I am sorry for my bad English. thank you.

iosifache commented 2 years ago

dike used a YAML configuration file that contains all the configurable aspects of its functioning. You can find out the values you mentioned by checking the dataset section in the configuration.yaml file.

And I'm glad to hear that these repositories are useful! Please let me know if you have any other questions, I'm happy to help.