Open fancyandew opened 9 months ago
@gsalzer Thank you very much for providing the data set, thank you very much for the answer. I found out the way you did that only when the property_holds column of rows with the same fp_sol column is f, it means true negative. The total number is 510, which I think is too small. If this data set is used to evaluate smart contract analysis tools, then the primary purpose of existing tools should be to detect whether the contract has vulnerabilities, and then analyze the types of vulnerabilities. The current data set has a large gap in the number of positive and negative samples. Can the author provide some true negative samples or tell me how they were collected?
I think you are counting the header line as well, as I get only 509 distinct Solidity sources where all assessed properties do not hold (property_holds=='f'
). Note that the same query, but counting unique runtime codes (on-chain contracts) instead of Solidity sources (using fp_runtime
instead of fp_sol
) yields 390 distinct contracts. This means that your criterion is fragile, and it needs careful considerations to define what you actually need.
Regarding the data selection process: it is all described in the paper linked to in the README of the repository. We started from data and assessments collected by others, see the folder construction/originalSets
for the original data. The authors of these original datasets employed different methods for collecting the data, and also had different motivations.
I'm not sure that it makes sense to search for source codes with negative assessments (property_holds=='f'
) only. This just means that all that is known about this source code is that someone found that one or more properties do not hold. It does not mean that the source code does not have any vulnerabilities, as nobody defined what all means, and nobody cared to check the code for all known vulnerabilities. Our dataset is just a summary and consolidation of what is known, and very probably there is much more to know about the contracts.
If you take the time to explain what you are actually looking for and what you are planning to do, I may be able to tell you in which way this dataset may help, and whether there are other sources around. If there is anything confidential about your project, write me an email; it shouldn't be difficult to find my address.
@gsalzer Thank you for your patient answer. I've looked at papers related to the dataset. I am conducting a study on using deep learning to detect contract vulnerabilities. The data you provided integrates previous vulnerability data sets, so I would like to quote this data set. I want to use deep learning to teach the machine how to identify loopholes and non-vulnerable contracts, but I lack normal contracts that do not contain vulnerability types such as swc and dasp in the article. I need these contracts to be used as negative sets for deep learning training. So I would like to ask you whether you retained these contracts when you manually organized the previous data sets, or where can these contracts be found, or through what means can these contracts be collected?
Not sure I'm understanding your question correctly. If you are looking for the actual Solidity source code, it can be found in the folder source
. As described in the main README,
fp_sol [hex string without 0x]: hash computed from the source code, after removing white space and comments. For the actual source code, see source/<fp_sol>.sol.
As an example, if the column fp_sol
contains the value 7391ec54eca9eb37bd2cd1d7263097a6
, then the source code is in source/7391ec54eca9eb37bd2cd1d7263097a6.sol
This is a good point: When evaluating some tool, (meaningful) true negatives are almost as important as true positives. In fact, the data set contains some true negatives: See all rows in
consolidated.csv
, where the fourth column,property_holds
, contains the valuef
. The purpose of the project was to consolidate existing data sets, not to create a new one, with new assessments. Hence it contains only those true positives that were present in one of the original datasets.