manifoldfinance / mevETH2

mevETH LST Protocol - Repo has migrated see link
https://github.com/MEV-Protocol/meveth
27 stars 2 forks source link

Testing: Saturation Effects and creating a High Assurance CI workflow for increasing confidence #9

Open sambacha opened 1 year ago

sambacha commented 1 year ago

Saturation Effect in Fuzzing

primary source for this is from: https://blog.regehr.org/archives/1796 The author works at TOB. The TLDR is we need to consider adding mutation testing along with assessing new tool from TOB called Diffusc. We should fuzz with actual execution environments (i.e geth) and consider having a small subset of nodes dedicated to providing a CI pipeline for engineering feedback (as opposed to the current DevOps orientation of what we utilise our testnet clusters for) Coverage is useless without knowing saturation, i.e the space of test cases that triggers a particular bug (re: program crash) that intersects with the probability distribution that is implicit in the fuzzer.

Revising the assurance of testing results

https://dl.acm.org/doi/10.1145/3339836

They suggest splitting a testing effort into k subsets, each with its own coverage measure, and stopping when all k subsets have identical coverage. The larger k is, the longer it will take to detect saturation, but the more confident you can be that testing is “done.” The stopping point does not need to be in terms of coverage; it could instead use automated bug triage to see if all runs have found identical sets of bugs.

Additionally, we could “do a better job at integrating feedback from the system under test” (in memory database or some websocket connection to directly observe and instrument CI processes)

Fuzzing Upgradable Contracts

https://blog.trailofbits.com/2023/07/07/differential-fuzz-testing-upgradeable-smart-contracts-with-diffusc/

“Diffusc combines static analysis with differential fuzz testing to compare two upgradeable smart contract (USC) implementations, which can uncover unexpected differences in behavior before an upgrade is performed on-chain. Built on top of Slither and Echidna, Diffusc performs differential taint analysis, uses the results to generate differential fuzz testing contracts in Solidity, and then feeds them into Echidna for fuzzing. It is, to my knowledge, the first implementation of differential fuzzing for smart contracts and should be used in combination with other auditing tools before performing an upgrade.”

Diffusc approach is a must implement in my opinion, since we rely on upgradability its a must have.

References:

Evaluating Fuzz Testing standardised set of benchmarks for fuzzers https://arxiv.org/pdf/1808.09700.pdf

https://dl.acm.org/doi/10.1145/3339836 “..finding linearly more instances of a given set of bugs requires linearly more computation power directed towards fuzzing a target; finding linearly more distinct bugs requires exponentially more fuzzing compute power.”

Breaking the Solidity Compiler with a Fuzzer https://blog.trailofbits.com/2020/06/05/breaking-the-solidity-compiler-with-a-fuzzer/

Safe Transaction Service Gateway

We need our own gateway service for the MutliSig for Mainnet and for Testnet

RPC Endpoints we need

debug_storageRangeAt eth_getStorageAll : custom RPC method, see the implementation here fuzzland/bsc-dumper/commit/f8124b4c0092a981de33229383b9aa1cd397ab32

Criteria

effectiveness

effectiveness as used by the fuzzing community relates to the quality of inputs rather than the oracle used.

efficacy

We use efficacy to denote the oracle quality. It is the term for oracle quality i

efficiency

Finally, the speed at which vulnerabilities can be found seems to be related to the speed at which one can execute the tests (assuming a uniform distribution and difficulty of finding them). Hence, we use efficiency to denote the speed of execution.

Once the trade-off between cohesion and decoupling is verified, the software units are organized into services to isolate the work of the different parts of the infrastructure. Each single service collects the software units which are logically related and relevant to its

We note that there is no agreement on what these terms mean in fuzzing research. For example, B ¨ohme et al. [10] defines efficiency as the rate at which vulnerabilities are discovered, and effectiveness as the total number of vulnerabilities that a fuzzer can discover in the limit (i.e. given infinite time). However, given that code coverage is used to measure effectiveness in a majority of papers [23], we believe that effectiveness as used by the fuzzing community relates to the quality of inputs rather than the oracle used. Hence, we use efficacy to denote the oracle quality here as it seems unused in fuzzing literature (and we could not find a corresponding term for oracle quality in fuzzing research). Finally, the speed at which vulnerabilities can be found seems to be related to the speed at which one can execute the tests (assuming a uniform distribution and difficulty of finding them). Hence, we use efficiency to denote the speed of execution.

Fuzzing Criteria

C.1. How to fuzz more types of software.
C.2. How to identify more types of vulnerabilities.
C.3. How to find more deep bugs.
C.4. What kind of vulnerabilities are not found by fuzzing.
C.5. How to leverage the auditor.
C.6. How to improve the usability of fuzzing tools.
C.7. How to assess the residual security risk.
C.8. What are the limitations of fuzzing.
C.9. How to evaluate more specialized fuzzers.
C.10. How to prevent overfitting to a specific benchmark?
C.11. Are synthetic bugs representative?
C.12. Are bugs discovered by fuzzers, representative?
C.13. Is coverage a good measure for fuzzer effectiveness?
C.14. What is a fair time budget?

Flaw Classification

Each flaw is classified according to its severity, considering the potential impact of the exploit to be:

– High if it affects a large numbers of users, or has serious legal and financial implications;
– Medium if it affects individual users’ information, or has possible legal implications for clients and moderate financial impact;
– Low if the risk is relatively small or is not a risk the customer has indicated is important;
– Informational if the issue does not pose an immediate risk, but is relevant to security best practices.

Difficulty

Another important property of each finding is how difficult it is to exploit: – Low for commonly exploited flaws where public tools exist or exploitation can be easily automated;
– Medium for flaws that require in-depth knowledge of a complex system;
– High for flaws where an attacker must have privileged insider access to the system, or must discover other weaknesses, for exploitation.

Unit Testing

It seems fair to say that even extensive unit tests are not the most effective way to avoid the kind of problems found in high-quality security audits.[^1]

Finally, while it is impossible to make strong claims based on a set of only 23 audits, it seems likely that unit tests, even quite substantial ones, do not provide an effective strategy for avoiding the kinds of problems detected during audits. Unit tests, of course, have other important uses, and should be considered an essential part of high-quality code development, but developer-constructed manual unit tests may not really help detect high-severity security issues. It does seem likely that the effort involved in writing high-quality unit tests would be very helpful in dynamic analysis: Generalizing from unit tests to invariants and properties for property-based testing seems likely to be an effective way to detect some of what the audits exposed.

[^1]: https://arxiv.org/pdf/1911.07567.pdf page 13

Appendix A

Category $\%$ High-Low Severity Difficulty
High Med. Low Info. Und. High Med. Low Und.
data validation $36 \%$ $11 \%$ $21 \%$ $36 \%$ $24 \%$ $13 \%$ $6 \%$ $27 \%$ $16 \%$ $55 \%$ $2 \%$
access controls $10 \%$ $25 \%$ $42 \%$ $25 \%$ $12 \%$ $21 \%$ $0 \%$ $33 \%$ $12 \%$ $54 \%$ $0 \%$
race condition $7 \%$ $0 \%$ $41 \%$ $41 \%$ $6 \%$ $12 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$
numerics $5 \%$ $23 \%$ $31 \%$ $23 \%$ $38 \%$ $8 \%$ $0 \%$ $31 \%$ $8 \%$ $62 \%$ $0 \%$
undefined behavior $5 \%$ $23 \%$ $31 \%$ $15 \%$ $31 \%$ $8 \%$ $15 \%$ $15 \%$ $8 \%$ $77 \%$ $0 \%$
patching $7 \%$ $11 \%$ $17 \%$ $11 \%$ $39 \%$ $28 \%$ $6 \%$ $6 \%$ $11 \%$ $61 \%$ $22 \%$
denial of service $4 \%$ $10 \%$ $20 \%$ $30 \%$ $30 \%$ $20 \%$ $0 \%$ $50 \%$ $0 \%$ $40 \%$ $10 \%$
authentication $2 \%$ $25 \%$ $50 \%$ $25 \%$ $25 \%$ $0 \%$ $0 \%$ $50 \%$ $0 \%$ $50 \%$ $0 \%$
reentrancy $2 \%$ $0 \%$ $50 \%$ $25 \%$ $25 \%$ $0 \%$ $0 \%$ $50 \%$ $25 \%$ $0 \%$ $25 \%$
error reporting $3 \%$ $0 \%$ $29 \%$ $14 \%$ $0 \%$ $57 \%$ $0 \%$ $43 \%$ $29 \%$ $29 \%$ $0 \%$
configuration $2 \%$ $0 \%$ $40 \%$ $0 \%$ $20 \%$ $20 \%$ $20 \%$ $60 \%$ $20 \%$ $20 \%$ $0 \%$
logic $1 \%$ $0 \%$ $33 \%$ $33 \%$ $33 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$
data exposure $1 \%$ $0 \%$ $33 \%$ $33 \%$ $0 \%$ $33 \%$ $0 \%$ $33 \%$ $33 \%$ $33 \%$ $0 \%$
timing $2 \%$ $25 \%$ $25 \%$ $0 \%$ $75 \%$ $0 \%$ $0 \%$ $75 \%$ $0 \%$ $25 \%$ $0 \%$
coding-bug $2 \%$ $0 \%$ $0 \%$ $67 \%$ $33 \%$ $0 \%$ $0 \%$ $17 \%$ $0 \%$ $83 \%$ $0 \%$
front-running $2 \%$ $0 \%$ $0 \%$ $80 \%$ $0 \%$ $20 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$
auditing and logging $4 \%$ $0 \%$ $0 \%$ $0 \%$ $33 \%$ $44 \%$ $22 \%$ $33 \%$ $0 \%$ $56 \%$ $11 \%$
missing-logic $1 \%$ $0 \%$ $0 \%$ $0 \%$ $67 \%$ $33 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$
cryptography $0 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$
documentation $2 \%$ $0 \%$ $0 \%$ $0 \%$ $25 \%$ $50 \%$ $25 \%$ $0 \%$ $0 \%$ $75 \%$ $25 \%$
API inconsistency $1 \%$ $0 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$
code-quality $1 \%$ $0 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$ $0 \%$ $0 \%$ $100 \%$ $0 \%$