TON Static Analysis based on Symbolic Execution

korifey commented 3 months ago

Summary

Static analysis tool for TON offers following advantages:

Interprocedural and context-sensitive
- Path-sensitive
- Bytecode level
- Bit-precise
- Minimal false-positives
- Extensible with new user rules
- Can be extended to automated test generation and property based testing tool

Context

TON requires advanced tools for testing smart contracts so we'd like to implement the best possible bunch of tools on the same symbolic execution core. We want to start with Static Analysis but a single core will enormously reduce development costs for other tools like Property-based testing and Automated test generation in future.

From a business perspective, we believe we can create a unique selling point for TON. Our tool will not only find bugs and vulnerabilities but will also significantly automate smart contracts audit and reduce costs for users.

Scenarios

Static analyzer usually takes source/bytecode of a smart contract as input and provides analysis results as output verifying some rules. Contemporary static analysis tool provides output in SARIF format supported by many tools. There are several forms how user can run the tool:

CLI (Command Line Interface) - the simplest form that we'll support first.
IDE Plugin (e.g. vscode) - can run static analyzer in background while modifying code, so developer gets results immediately.
CI Plugin (e.g. github) - static analysis is one of the quality gates for code.

Typically static analysis can have predefined and custom rules. Some predefined examples:

Integer overflows/underflows
Division by zero
Dead code
Cells overflow
Cells underflow
Sending of invalid messages
Non-deterministic storage growth

It's extremely important to provide flexible API to define user-defined checks for certain classes of analysis such as:

Taint analysis
Typestate analysis
Temporal properties checks (like sum_backers <= balance)
Gas usage warnings

Technical advantages of the proposed approach

The core of analyzer is built upon a cutting-edge symbolic execution engine for TVM bytecode. It uses symbolic execution with SMT solving and taint analysis to find bugs, vulnerabilities, and automatically generate unit tests with mathematical precision.

Interprocedural and context-sensitive. The analyzer considers all possible behaviors of a contract taking into account complex data flow with function calls and recursion.
Path-sensitive. The analyzer takes into account all branching conditions in a contract. The majority of existing static analyzers do not support this, leading to multiple false alarms. Our engine checks the feasibility of various execution paths of a contract.
Bytecode level. Compilers have bugs. If an analyzer hasn't found an error in source code, strictly speaking, this does not mean the contract which is actually executed doesn't have it. Our engine works on TVM bytecode level and misses nothing. Additionally, libraries are written on bytecode. As we noticed developers love to inject bytecode right into their smart contracts. So, source level analysis is not enough.
Bit-precise: the engine will take into account all potential integer overflows, bit-level hacks, etc. This is especially important, taking into account the TVM memory model.
Minimal false-positives. The semantics of Ton Virtual Machine is modeled with mathematical precision. This means the analyzer understands what the contract does and how much gas it costs without over-approximation of language features. Thus, the only source of false alarms is the interaction of a contract with the environment, but user assistance might fix that.
Extensible with new user rules. Users can give the analyzer more information on what the environment does: which invariants are true for the data in the message etc. Also, the engine supports customizable taint analysis, and users can configure it with new rules to check, for instance, that potentially malicious data does not get into sensitive APIs.

Team

We are a team of professionals with a rich industry and academic background: 10+ years in developer tools, 7+ years in code analysis and symbolic execution. We’ve been delivering precise and scalable symbolic execution engines for C/C++, Java and .NET Core into industrial settings. We gained awards in acknowledged academic competitions:

kleef, our overhaul of the well-known klee engine, was chosen the best symbolic execution engine in TestComp 2024.
UTBot Java, was chosen the best symbolic execution engine in SBFT 2023.
UTBot Python, was chosen the best symbolic execution engine for Python in SBFT 2024.

Plan for milestone 1

Since we've already spent 1 month on R&D How to implement symbolic execution for TVM based on USVM, we need additional 6 weeks to deliver a demo of our static analysis capabilities.

Sprint 0 (already done)

Symbolic interpreter supporting bytecode instructions:

Jumps
Conditions
Cells

Sprint 1 (2 weeks)

Support bytecode instructions to verify some exiting real smart contracts:

Dictionaries
Loops

Sprint 2 (2 weeks)

Try to interpret some real contracts. Support base checks for demo:

Integer overflow/underflow (TVM exit code 4) in all instructions that can throw it (such as ADD, SUB, etc, except divisions).
Division by zero (also TVM exit code 4). Checking all instructions DIV, MOD, DIVMOD, etc.
Cell underflow (TVM exit code 9) in all cell parsing instructions (such as LDU, LDREF, etc)
Cell overflow (TVM exit code 8) in all builder writing instructions (such as STU, STREF, etc)

Sprint 3 (2 weeks)

Polish checks on real contracts, maybe support additional instructions. Provide CLI that takes a contract and returns check results in SARIF format. Provide documentation and report how it works.

References

Similar engines have been successfully applied for the analysis of Ethereum smart contracts on EVM bytecode level. Among them are mythril, manticore, and oyente.

Some of mentioned links above:

Estimate suggested reward

💰 9000 USD 🕑 6 weeks (3 sprints)

anton-trunov commented 3 months ago

This is an absolutely fantastic proposal based on state-of-the-art research! Compared to the one in #436 it works at the level of TVM which makes it suitable for any language that compiler down to TVM.

@korifey I'm wondering in case some bugs are found at the level of TVM instructions, how do you report the error to the programmer so they would understand where to look for it in the source code which is FunC or Tact at the moment?

howardpen9 commented 3 months ago

happy to see more dev tool guys! (also support in Tact plz)

korifey commented 3 months ago

This is an absolutely fantastic proposal based on state-of-the-art research! Compared to the one in #436 it works at the level of TVM which makes it suitable for any language that compiler down to TVM.

@korifey I'm wondering in case some bugs are found at the level of TVM instructions, how do you report the error to the programmer so they would understand where to look for it in the source code which is FunC or Tact at the moment?

Thanks for you good words.

How to properly render in source code errors found by bytecode analyzer? Usually high-level languages compilers remain some debug information (like line number) when compile source code into low level code. For example in Java: public equals(Ljava/lang/Object;)Z L0 LINENUMBER 149 L0 ALOAD 0 ALOAD 1 IF_ACMPNE L1 ICONST_1 GOTO L2 L1

So, if we get such kind of support from Tact/FuncC compiler, analyzer will return nice SARIF traces.

Another option is to generate test that lead to error. This is an ultimate power of symbolic execution - not only show error but create test (exploit) that demonstrate error behavior.

anton-trunov commented 3 months ago

Usually high-level languages compilers remain some debug information (like line number) when compile source code into low level code

This is a nice solution and is certainly doable.

I have one more concern, though. Symbolic execution can be quite slow, so it could be challenging to integrate analyzers based on it into some interactive dev tools like IDEs like Intellij IDEA or programmer's editors like VS Code. How do you solve this issue?

korifey commented 3 months ago

Usually high-level languages compilers remain some debug information (like line number) when compile source code into low level code

This is a nice solution and is certainly doable.

I have one more concern, though. Symbolic execution can be quite slow, so it could be challenging to integrate analyzers based on it into some interactive dev tools like IDEs like Intellij IDEA or programmer's editors like VS Code. How do you solve this issue?

In fact for traditional programming languages programs can be very large (millions lines of code) so symbolic execution alone is too slow. In this case we use primary filter of vulnerabilities by dataflow (IFDS) + point-to. Then symbolic execution comes to play, filtering out false positives of previous stage. But smart contract are relatively small programs and highly optimized summary-based symbolic execution seems like enough for small contracts.
Speaking about interactive mode (background analysis in IDE during user typing), I'd say it's not a target scenario for this bounty. Default way of running delivered analysis in IDE is "push-button" scenario when user explicitly say "analyze this" and wait until compilation (to get bytecode) and wait some time until precise analysis results.

I'm not sure compilation phase for Tact is fast enough for real interactive mode. Our combined dataflow & symbolic execution engine allows different tradeoffs between precision and speed. We could implement source code level analysis but it requires another bounty and more R&D.

korifey commented 3 months ago

Hi guys, what is your opinion about the proposal? Do you need more clarification or not?

Damtev commented 2 months ago

Hi @anton-trunov. Is there any progress on the remaining debug information (especially line numbers) during compilation?

anton-trunov commented 2 months ago

Hi @Damtev, no progress so far from our side, unfortunately. I opened an issue https://github.com/tact-lang/tact/issues/296, so you can subscribe to it and track progress. We have an upcoming Tact v1.3.0 release this week or next week tops. We'll try our best to get line number information in Tact v1.4.0

korifey commented 1 month ago

Hi, we finished first milestone and prepared some video and script for you to show capabilities of proposed approach.

Here is video report: https://youtu.be/0r5TbLAXF9o Here is open source repository with examples, README: https://github.com/explyt/ton-bounties/

P.S. We are ready to continue evolve our engine for TON static analysis, property based testing and beyond

ton-society / grants-and-bounties