[Q] naming convention for predicates, class fields, and arguments

izlatkin commented 2 years ago

For instance, this encodes a call to function set: (summary_6_function_set__110_111_0 error_0 this_0 crypto_0_ecrecover_21 crypto_0_keccak256_22 crypto_0_ripemd160_23 crypto_0_sha256_24 tx_0_block.basefee_3 tx_0_block.chainid_4 tx_0_block.coinbase_5 tx_0_block.difficulty_6 tx_0_block.gaslimit_7 tx_0_block.number_8 tx_0_block.timestamp_9 tx_0_blockhash_10 tx_0_msg.data_11_bytes_tuple_accessor_array_1 tx_0_msg.data_11_bytes_tuple_accessor_length_2 tx_0_msg.sender_12 tx_0_msg.sig_13 tx_0_msg.value_14 tx_0_tx.gasprice_15 tx_0_tx.origin_16 state_0_balances_0 x_2_0 _x_102_0 state_1_balances_0 x_2_1 _x_102_1)

fun:

    function set(uint _x) public {
        x = _x;
    }

The contract has one field x, and set has one argument _x . I see them encoded as two last arguments of summary_6_function_set__110_111_0:x_2_1 and _x_102_1

But we need a rule to determine, based on the encoding only, how many arguments represent inputs. Do all of them by convention have names that begin with an underscore (silly question)?

the ADT encoding also has this:

(summary_6_function_set__110_111_0 error_0 this_0 abi_0 crypto_0 tx_0 state_0 x_2_0 _x_102_0 state_1 x_2_1 _x_102_1)

The last two (x_2_1, _x_102_1) are the the field and argument, resp.

leonardoalt commented 2 years ago

x and _x must be different program variables, since the actual name is used as a prefix. There are always two sets of state variables (here x), two sets of function input parameter variables (here _x) and one set of function output parameter variables (here none).

izlatkin commented 2 years ago

comments from @leonardoalt about the naming convention: the first number is a unique counter, then it's the node id then the contract node id

example: https://github.com/ethereum/solidity/blob/develop/libsolidity/formal/CHC.cpp#L1429 https://github.com/ethereum/solidity/blob/develop/libsolidity/formal/CHC.cpp#L1342

grigoryfedyukovich commented 2 years ago

The question is about how to figure out which var is a field, and which is a function argument. I understand the difference between x and _x in the code, but I don't understand how to get it from CHCs.

leonardoalt commented 2 years ago

Note that this is valid for the names of predicate names. It's not valid for tx_*, crypto_*, state_* etc

leonardoalt commented 2 years ago

Variables (state or function input/output parameters) have the pattern name_id_uniqueCounter.

leonardoalt commented 2 years ago

The question is about how to figure out which var is a field, and which is a function argument. I understand the difference between x and _x in the code, but I don't understand how to get it from CHCs.

You can extract the AST id of a variable and find that AST node in the AST output of the compiler.

grigoryfedyukovich commented 2 years ago

Still unclear. How can we figure out the arity of function?

grigoryfedyukovich commented 2 years ago

You can extract the AST id of a variable and find that AST node in the AST output of the compiler.

We only deal with CHCs at this point.

leonardoalt commented 2 years ago

You can try to guess, but to be sure one needs to match the AST.

grigoryfedyukovich commented 2 years ago

No, that's not going to work

leonardoalt commented 2 years ago

It's not possible to infer the arity of the function just from the encoding since input variables come after state variables, so you don't know which is which.

grigoryfedyukovich commented 2 years ago

This pattern would fix the issue: name_id_F_uniqueCounter where F would be 0 if the variable is a field, 1 if it is an input, and 2 if it is an output.

grigoryfedyukovich commented 2 years ago

so you don't know which is which.

I figured, and that's exaclty what I was asking.

leonardoalt commented 2 years ago

As far as inferring things go, there are also things like function overriding. Even if you know the name of a function that was called, you don't know from which contract just by looking at the encoding.

grigoryfedyukovich commented 2 years ago

One step at a time. Can we assume there is no overriding for now?

grigoryfedyukovich commented 2 years ago

And that there is only one contract...

grigoryfedyukovich commented 2 years ago

If we are able to only discover all the necessary info from CHC encoding, we won't have blockers and will be able to deliver the first prototype by the end of the week. If we start integrating the compiler now, I've no idea about ETA.

leonardoalt commented 2 years ago

The encoding already handles virtual calls and these things, so you don't have to worry about that on the encoding side. The only thing that may be ambiguous is when you report that a certain was called in the presence of such functions. Ofc no need to worry about that now.

grigoryfedyukovich commented 2 years ago

The thing is about changing the encoding slightly (as I suggested), which will allow us to generate tests. Without that, we can only "guess" (as you suggested) but it will result in a pile of uncompilable tests.

grigoryfedyukovich commented 2 years ago

So to be absolutely clear, the question is "Could you please implement the new naming convention for us"?

leonardoalt commented 2 years ago

Ah sorry, the comment with your suggestion only showed now in this thread for me. Sure, that shouldn't be a problem.

izlatkin commented 2 years ago

As a workaround, I can extract # of input parameters and types (signature) It is already implemented

For example:

[[['Crypto', 'contract'], ['hash', 'bytes32']], [['C', 'contract'], ['f1', 'bytes32'], ['inv']]]

from

contract Crypto {
    function hash(bytes32) external pure returns (bytes32) {
        return bytes32(0);
    }
}

contract C {
    address owner;
    bytes32 sig_1;
    bytes32 sig_2;
    Crypto d;

    constructor() {
        owner = msg.sender;
    }

    function f1(bytes32 _msg) public {
        address prevOwner = owner;
        sig_1 = d.hash(_msg);
        sig_2 = d.hash(_msg);
        assert(prevOwner == owner);
    }

    function inv() public view {
        assert(sig_1 == sig_2);
    }
}

leonardoalt commented 2 years ago

Nice! How are you doing that?

grigoryfedyukovich commented 2 years ago

I think this is just lexical/syntax analysis which might be imprecise.

grigoryfedyukovich commented 2 years ago

Is there any fundamental difference in signatures:

function hash(bytes32) external pure returns

and

function f1(bytes32 _msg) public

The first function has input type, but does not name it.

leonardoalt commented 2 years ago

I know you don't want to implement compiler integration just yet, which makes total sense, but let me share here what the AST thing looks like so you'll see it's quite simple:

contract C {
    uint x;

    function f(uint _x) public {
        x = _x;
    }
}

compiling the above with solc a.sol --ast-compact-json gives

{
  "absolutePath": "a.sol",
  "exportedSymbols": {
    "C": [
      13
    ]
  },
  "id": 14,
  "nodeType": "SourceUnit",
  "nodes": [
    {
      "abstract": false,
      "baseContracts": [],
      "canonicalName": "C",
      "contractDependencies": [],
      "contractKind": "contract",
      "fullyImplemented": true,
      "id": 13,
      "linearizedBaseContracts": [
        13
      ],
      "name": "C",
      "nameLocation": "9:1:0",
      "nodeType": "ContractDefinition",
      "nodes": [
        {
          "constant": false,
          "id": 2,
          "mutability": "mutable",
          "name": "x",
          "nameLocation": "19:1:0",
          "nodeType": "VariableDeclaration",
          "scope": 13,
          "src": "14:6:0",
          "stateVariable": true,
          "storageLocation": "default",
          "typeDescriptions": {
            "typeIdentifier": "t_uint256",
            "typeString": "uint256"
          },
          "typeName": {
            "id": 1,
            "name": "uint",
            "nodeType": "ElementaryTypeName",
            "src": "14:4:0",
            "typeDescriptions": {
              "typeIdentifier": "t_uint256",
              "typeString": "uint256"
            }
          },
          "visibility": "internal"
        },
        {
          "body": {
            "id": 11,
            "nodeType": "Block",
            "src": "51:14:0",
            "statements": [
              {
                "expression": {
                  "id": 9,
                  "isConstant": false,
                  "isLValue": false,
                  "isPure": false,
                  "lValueRequested": false,
                  "leftHandSide": {
                    "id": 7,
                    "name": "x",
                    "nodeType": "Identifier",
                    "overloadedDeclarations": [],
                    "referencedDeclaration": 2,
                    "src": "55:1:0",
                    "typeDescriptions": {
                      "typeIdentifier": "t_uint256",
                      "typeString": "uint256"
                    }
                  },
                  "nodeType": "Assignment",
                  "operator": "=",
                  "rightHandSide": {
                    "id": 8,
                    "name": "_x",
                    "nodeType": "Identifier",
                    "overloadedDeclarations": [],
                    "referencedDeclaration": 4,
                    "src": "59:2:0",
                    "typeDescriptions": {
                      "typeIdentifier": "t_uint256",
                      "typeString": "uint256"
                    }
                  },
                  "src": "55:6:0",
                  "typeDescriptions": {
                    "typeIdentifier": "t_uint256",
                    "typeString": "uint256"
                  }
                },
                "id": 10,
                "nodeType": "ExpressionStatement",
                "src": "55:6:0"
              }
            ]
          },
          "functionSelector": "b3de648b",
          "id": 12,
          "implemented": true,
          "kind": "function",
          "modifiers": [],
          "name": "f",
          "nameLocation": "33:1:0",
          "nodeType": "FunctionDefinition",
          "parameters": {
            "id": 5,
            "nodeType": "ParameterList",
            "parameters": [
              {
                "constant": false,
                "id": 4,
                "mutability": "mutable",
                "name": "_x",
                "nameLocation": "40:2:0",
                "nodeType": "VariableDeclaration",
                "scope": 12,
                "src": "35:7:0",
                "stateVariable": false,
                "storageLocation": "default",
                "typeDescriptions": {
                  "typeIdentifier": "t_uint256",
                  "typeString": "uint256"
                },
                "typeName": {
                  "id": 3,
                  "name": "uint",
                  "nodeType": "ElementaryTypeName",
                  "src": "35:4:0",
                  "typeDescriptions": {
                    "typeIdentifier": "t_uint256",
                    "typeString": "uint256"
                  }
                },
                "visibility": "internal"
              }
            ],
            "src": "34:9:0"
          },
          "returnParameters": {
            "id": 6,
            "nodeType": "ParameterList",
            "parameters": [],
            "src": "51:0:0"
          },
          "scope": 13,
          "src": "24:41:0",
          "stateMutability": "nonpayable",
          "virtual": false,
          "visibility": "public"
        }
      ],
      "scope": 14,
      "src": "0:67:0",
      "usedErrors": []
    }
  ],
  "src": "0:68:0"
}

In this JSON you have direct knowledge about a contract's state variables and a function's input/output parameters.

izlatkin commented 2 years ago

A very straightforward approach for now and, as mentioned, might be imprecise for some cases just couple methods https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L188 https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L221

leonardoalt commented 2 years ago

Is there any fundamental difference in signatures

For the encoding the differences will be in the name and return types, but not naming a parameter shouldn't influence the predicate's name.

leonardoalt commented 2 years ago

A very straightforward approach for now and, as mentioned, might be imprecise for some cases just couple methods https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L188 https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L221

Right. If you already have the source code and solc, I think you could detect the things you need without much more code using the python JSON lib.

izlatkin commented 2 years ago

A very straightforward approach for now and, as mentioned, might be imprecise for some cases just couple methods https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L188 https://github.com/izlatkin/solidity_testgen/blob/main/scripts/SolidityTestGen.py#L221

Right. If you already have the source code and solc, I think you could detect the things you need without much more code using the python JSON lib.

right, I think we can parse json now

grigoryfedyukovich commented 2 years ago

Is there any fundamental difference in signatures

For the encoding the differences will be in the name and return types, but not naming a parameter shouldn't influence the predicate's name.

Don't get it, still. What is the SMT name of the argument of hash function?

leonardoalt commented 2 years ago

should be "_uniqueCounter_ssaCounter" as far as I remember

leonardoalt commented 2 years ago

(argh GH formatting is destroying my msg)

grigoryfedyukovich commented 2 years ago

right, I think we can parse json now

I don't see how it helps us now. All our CHC-related code is in C++ now. We get an SMT model directly via Z3 API. We'd need to write code to serialize it and parse again from python.

leonardoalt commented 2 years ago

I don't know what the exact workflow is, I just suggested it as the precise replacement for the python code that @izlatkin sent.

leonardoalt commented 2 years ago

How is that code used currently in the workflow?

grigoryfedyukovich commented 2 years ago

I totally agree that @izlatkin should not wast time writing his own parser since it's already available

grigoryfedyukovich commented 2 years ago

How is that code used currently in the workflow?

@izlatkin

grigoryfedyukovich commented 2 years ago

I think our CHC-based CEX generator and the python-frontend are disconnected, still

grigoryfedyukovich commented 2 years ago

In order to connect, we need a correct mapping from an SMT model to a sequence of function calls in Solidity

grigoryfedyukovich commented 2 years ago

And for that, we need to know arities, etc..

grigoryfedyukovich commented 2 years ago

Ideally, we need this info together in the same module where we deal with SMT, that is in the C++ module.

grigoryfedyukovich commented 2 years ago

Alternatives are possible, but will take longer to develop

leonardoalt commented 2 years ago

I will make a branch with the naming changes and send it to you tomorrow.

leonardoalt commented 2 years ago

In order to connect, we need a correct mapping from an SMT model to a sequence of function calls in Solidity

This will require a few more naming changes, but should also be possible.

grigoryfedyukovich commented 2 years ago

P.S. I think it's kind of onanism to chat in the comments of github. Should move to Discord;)

grigoryfedyukovich commented 2 years ago

Can you please re-invite me to your server?

leonardoalt commented 2 years ago

Can you please re-invite me to your server?

just sent you an email with the invite

izlatkin commented 2 years ago

The brief workflow is currently following (executes by Python scripts):

Encode sol file to smt.
run ADT transform scripts - get ADT free encoding
get smart contract signature (it used to be just brute force parsing, but now can be a preprocessed json from solc a.sol --ast-compact-json )
run main algorithm with parameters from step 2 and 3 and get tests
execute tests
gather coverage
generate report

izlatkin commented 2 years ago

comments: here's a new soljson.js that has the types of the variables in their SMT names: https://output.circle-artifacts.com/output/job/72b84605-390a-4d0b-9a9b-202170bb7033/artifacts/0/soljson.js varName_nodeId_varType_SSAIndex is the new format where varType is: 0: state variable 1: function input var 2: function output var 3: function local var

izlatkin / solidity_testgen

[Q] naming convention for predicates, class fields, and arguments #7