data set - Githubissues

CJY-01 commented 3 years ago

Is the data set in Github consistent with the data set in the paper？

kppw99 commented 3 years ago

Yes, the data set in GitHub is consistent with the paper. Thanks for your interest.

CJY-01 commented 3 years ago

It appears that API is not used as the slicing criterion when performing program slicing.At the same time, we found that only the C codes were sliced in the SARD dataset.The data set processed in the project is also quite different from that in the paper. For example, the number with label 0 in the NVD data set is much larger than label 1 .Could you elaborate further on these questions?

kppw99 commented 3 years ago

The dataset has been uploaded for only the C language. Sorry for the confusion with the above answer. In the case of a slicing criterion, we applied some heuristics points such as arithmetic, array, etc., in addition to the vulnerable API list. The dataset has more benign data (label 0) than vulnerable data (label 1), so that we applied the oversampling technique such as SMOTEENN. On the README page, we will specify the differences from the dataset of the paper. Thank you for your valuable advice.

kppw99 / AutoVAS

data set #1