gsalzer / cgt

Consolidated Ground Truth (CGT) for Weaknesses of Ethereum Smart Contracts
MIT License
16 stars 1 forks source link

This data set contains all contracts with vulnerabilities. Are there no contracts without vulnerabilities? #1

Open fancyandew opened 9 months ago

gsalzer commented 9 months ago

This is a good point: When evaluating some tool, (meaningful) true negatives are almost as important as true positives. In fact, the data set contains some true negatives: See all rows in consolidated.csv, where the fourth column, property_holds, contains the value f. The purpose of the project was to consolidate existing data sets, not to create a new one, with new assessments. Hence it contains only those true positives that were present in one of the original datasets.

fancyandew commented 9 months ago

@gsalzer number of true negatives Thank you very much for providing the data set, thank you very much for the answer. I found out the way you did that only when the property_holds column of rows with the same fp_sol column is f, it means true negative. The total number is 510, which I think is too small. If this data set is used to evaluate smart contract analysis tools, then the primary purpose of existing tools should be to detect whether the contract has vulnerabilities, and then analyze the types of vulnerabilities. The current data set has a large gap in the number of positive and negative samples. Can the author provide some true negative samples or tell me how they were collected?

gsalzer commented 9 months ago

I think you are counting the header line as well, as I get only 509 distinct Solidity sources where all assessed properties do not hold (property_holds=='f'). Note that the same query, but counting unique runtime codes (on-chain contracts) instead of Solidity sources (using fp_runtime instead of fp_sol) yields 390 distinct contracts. This means that your criterion is fragile, and it needs careful considerations to define what you actually need.

Regarding the data selection process: it is all described in the paper linked to in the README of the repository. We started from data and assessments collected by others, see the folder construction/originalSets for the original data. The authors of these original datasets employed different methods for collecting the data, and also had different motivations.

I'm not sure that it makes sense to search for source codes with negative assessments (property_holds=='f') only. This just means that all that is known about this source code is that someone found that one or more properties do not hold. It does not mean that the source code does not have any vulnerabilities, as nobody defined what all means, and nobody cared to check the code for all known vulnerabilities. Our dataset is just a summary and consolidation of what is known, and very probably there is much more to know about the contracts.

If you take the time to explain what you are actually looking for and what you are planning to do, I may be able to tell you in which way this dataset may help, and whether there are other sources around. If there is anything confidential about your project, write me an email; it shouldn't be difficult to find my address.

fancyandew commented 9 months ago

@gsalzer Thank you for your patient answer. I've looked at papers related to the dataset. I am conducting a study on using deep learning to detect contract vulnerabilities. The data you provided integrates previous vulnerability data sets, so I would like to quote this data set. I want to use deep learning to teach the machine how to identify loopholes and non-vulnerable contracts, but I lack normal contracts that do not contain vulnerability types such as swc and dasp in the article. I need these contracts to be used as negative sets for deep learning training. So I would like to ask you whether you retained these contracts when you manually organized the previous data sets, or where can these contracts be found, or through what means can these contracts be collected?

gsalzer commented 9 months ago

Not sure I'm understanding your question correctly. If you are looking for the actual Solidity source code, it can be found in the folder source. As described in the main README, fp_sol [hex string without 0x]: hash computed from the source code, after removing white space and comments. For the actual source code, see source/<fp_sol>.sol. As an example, if the column fp_sol contains the value 7391ec54eca9eb37bd2cd1d7263097a6, then the source code is in source/7391ec54eca9eb37bd2cd1d7263097a6.sol