sophos / SOREL-20M

Sophos-ReversingLabs 20 million sample dataset
Apache License 2.0
637 stars 132 forks source link

is_malware field meaning #19

Open edvinhallvaxhiu opened 2 years ago

edvinhallvaxhiu commented 2 years ago

Hello! In the README states that no benign samples are included in the dataset. While exploring the meta.db in s3://sorel-20m/09-DEC-2020/processed-data/meta.db, I noticed that the db contains a field "is_malware". For almost 50% of the dataset the value is set to 0. Could you provide some more information on how to read this field?

Thank you!

gxenos commented 2 years ago

So the dataset contains features from both benign and malware samples (50% benign, 50% malware). The authors have also published the actual executables of all malware samples in the dataset. They are unable to provide the benign executables for various reasons (copyrights, etc) so the line from the README you are quoting refers to that.