Closed CJY-01 closed 3 years ago
Yes, the data set in GitHub is consistent with the paper. Thanks for your interest.
It appears that API is not used as the slicing criterion when performing program slicing.At the same time, we found that only the C codes were sliced in the SARD dataset.The data set processed in the project is also quite different from that in the paper. For example, the number with label 0 in the NVD data set is much larger than label 1 .Could you elaborate further on these questions?
The dataset has been uploaded for only the C language. Sorry for the confusion with the above answer. In the case of a slicing criterion, we applied some heuristics points such as arithmetic, array, etc., in addition to the vulnerable API list. The dataset has more benign data (label 0) than vulnerable data (label 1), so that we applied the oversampling technique such as SMOTEENN. On the README page, we will specify the differences from the dataset of the paper. Thank you for your valuable advice.
Is the data set in Github consistent with the data set in the paper?