Runtime bytecode without metadata cannot get a "full match"

Neurone commented 8 months ago

Description

In the absence of metadata in the on-chain code, the verification always returns a partial match when it would be better to specify that it is a “full match” but precisely without metadata.

In this example (https://hashscan.io/testnet/contract/0.0.5783975), I still had to attach fake metadata to get the verification through.

Steps to reproduce

Deploy a smart contract without metadata
Try to validate the contract (you can use any metadata you prefer, just make sure the declared source code hash matches)
See the verification returns a partial match.

Additional context

No response

Hedera network

mainnet, testnet, previewnet, other

Version

N/A

Operating system

None

ericleponner commented 8 months ago

Just curious: how did you build your bytecode without metadata ? I tried solc with --metadata-hash none and deployed: it removes IPFS hash but metadata hash encoding and compiler version remains. See 0.0.5821552.

Neurone commented 8 months ago

Hi @ericleponner, you can use the --no-cbor-metadata flag.

ericleponner commented 8 months ago

That's right: I read that just after click Comment button :/ Sorry for the noise. Ok… that means that even the compiler version can be missing from the bytecode available from mirror node :( I understand better why verification tools ask user to provide that piece of info …

svienot commented 7 months ago

This is one of the principles of the Sourcify verification (not saying that we shouldn't change it). I am wondering whether it would be weird to report a FULL MATCH when we have no way to know whether the sources have been edited...

acuarica commented 6 months ago

I believe reporting full match in the scenario you described may be misleading. From [1] (emphasis mine)

In other words, the deployed contract and the given source code + metadata function as the same but there are differences in source code comments, variable names, or other metadata fields such as source paths.

This type of match is similar to how Etherscan verifies contracts. Yes, the matching source code in theory functions the same as the deployed contract but the displayed source code can be misleading or the bytecode can contain excecutable instructions not seen in the source code.

Essentially this means that different contracts can lead to the same runtime bytecode (excluding metadata bytecodes). This happens for example when you change comments, variables names or even the contract name. That's why it's better to report it as partial match instead of full match.

@ericleponner @svienot @Nana-EC thoughts?

@Neurone if you are happy with this definition we can close this.

[1] https://docs.sourcify.dev/docs/full-vs-partial-match/#partial-matches

svienot commented 6 months ago

I am in agreement with that @acuarica

ericleponner commented 6 months ago

I am also in agreement with your proposal @acuarica

Neurone commented 6 months ago

Sourcify's definition of a full match is arbitrarily limiting without adding value to the verification operation.

I'm not saying the ecosystem is not used to those definitions; I'm saying we can improve them by adding a new definition and explaining how and why it's fine to have a full match for a smart contract without metadata attached.

Metadata is an off-chain invention unrelated to EVM, and it derives from the source code to help developers.

Metadata allows developers to quickly pass information about verifying their contract; they were never meant to be a security feature where you cannot get a full match just because you want to improve a comment or a variable name.

Putting multi-hash of the metadata alongside the smart contract code was meant to facilitate sharing that metadata in a decentralized manner, not as a security feature of the smart contract.

Smart contract verification means providing input to someone and proving that input produces the same output they see on-chain.

The inputs are the source code, the compiler's version, and its parameters.

You have a full match if you can obtain the same output autonomously with those inputs.

If you get a different output, but the only difference is in the bytes exceeding the executable code (essentially gibberish - regardless of whether it contains somehow meaningful metadata for someone), that is a partial match.

If the executable bytecode does not match, that is a nonmatch.

Having different source code producing the same output is fine.

I can send you a Vyper contract different from a Solidity contract the developer initially used to produce the bytecode and specific Vyper compiler parameters, and that can still produce a valid and full matching output; there's no security or hidden risk with that.

If I provide the source code and the correct compiler parameters (--no-cbor-metadata, --optimize, --optimize-runs, etc.) to let you reach the same result on-chain, that is a full match.

Nana-EC commented 6 months ago

Thanks @Neurone From a decentralized point of view I do see what your highlighting. Given this repo essentially forks the sourcify repo and uses their verification logic and use of metadata files it makes sense to stay in line with it at this juncture.

However, we can certainly revisit this and see how to modify/enhance/divert from the implemented dependency logic.

acuarica commented 6 months ago

After discussing with @Neurone, we didn't reach a clear cut consensus. However, we agreed that the distinction between partial and full match may seem partial match look as a somewhat incomplete match. We have discussed whether renaming it to something different, e.g., bytecode match maybe a better option.

The issue I see with this is that we deviate from Sourcify and Etherscan definitions, which can be problematic for the user used to those definitions.

Neurone commented 6 months ago

I'm still convinced I should obtain a full match badge, even under Sourcify's definition (https://docs.sourcify.dev/docs/full-vs-partial-match/#full-perfect-matches), if I can verify a smart contract with no garbage - additional data after the executable bytecode - bit-by-bit. I can also derive the metadata from the source code and verify them.

But let's assume we don't want to modify Sourcify's code to get a full (perfect) match assignment; I suggest at least improving the current partial match assignment with something different, like a full match (no metadata provided) or something similar.

Below are the reasons why we should fix Sourcify's verification source code instead.

Starting from an Example.sol source code like this:

// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.0 <0.9.0;

contract Example {
    string public publicname = "I'm an example";
}

Compile with this parameters:

$ solc --metadata --bin --no-cbor-metadata --overwrite -o . Example.sol

You get this Example_medata.json metadata file:

{
    "compiler": {
        "version": "0.8.23+commit.f704f362"
    },
    "language": "Solidity",
    "output": {
        "abi": [
            {
                "inputs": [],
                "name": "publicname",
                "outputs": [
                    {
                        "internalType": "string",
                        "name": "",
                        "type": "string"
                    }
                ],
                "stateMutability": "view",
                "type": "function"
            }
        ],
        "devdoc": {
            "kind": "dev",
            "methods": {},
            "version": 1
        },
        "userdoc": {
            "kind": "user",
            "methods": {},
            "version": 1
        }
    },
    "settings": {
        "compilationTarget": {
            "Example.sol": "Example"
        },
        "evmVersion": "shanghai",
        "libraries": {},
        "metadata": {
            "appendCBOR": false,
            "bytecodeHash": "ipfs"
        },
        "optimizer": {
            "enabled": true,
            "runs": 200
        },
        "remappings": []
    },
    "sources": {
        "Example.sol": {
            "keccak256": "0xb7b5decb8134b6e5dc8c0cded692eb0dc2aa573de5fa772ec5845580d24c0a07",
            "license": "GPL-3.0",
            "urls": [
                "bzz-raw://e1515e4412634066bda9f71ed5fb030ba720148e94a078989a72a53c96a62d24",
                "dweb:/ipfs/QmV9MqJRxQfePmpnXjUr2oTvS3jMqQ8rebCFtCLvDnNpmU"
            ]
        }
    },
    "version": 1
}

Please note the "appendCBOR": false option there, which is a valid option that impact the final bytecode precisely as any other compiler's parameter, i.e., "optimizer": true.

You get also this Example.bin runtime bytecode:

608060405234801561000f575f80fd5b5060043610610029575f3560e01c80639001b1a21461002d575b5f80fd5b61003561004b565b60405161004291906100d6565b60405180910390f35b5f805461005790610122565b80601f016020809104026020016040519081016040528092919081815260200182805461008390610122565b80156100ce5780601f106100a5576101008083540402835291602001916100ce565b820191905f5260205f20905b8154815290600101906020018083116100b157829003601f168201915b505050505081565b5f602080835283518060208501525f5b81811015610102578581018301518582016040015282016100e6565b505f604082860101526040601f19601f8301168501019250505092915050565b600181811c9082168061013657607f821691505b60208210810361015457634e487b7160e01b5f52602260045260245ffd5b5091905056

Deploy that smart contract, and you will obtain this result: https://hashscan.io/testnet/contract/0.0.6799542

Now, provide the source code and metadata file to someone - or just the source code and the compiler's parameters above - and they will be able to obtain the same bytecode they see on-chain.

They will be able to prove also the metadata are correct, because the metadata are generated by the compiler starting from the source code. And using the correct parameters they will get the exact same metadata file.

So, this should be a regular full match.

Currently, we cannot get a full match using the Sourcify repo and validation code, and we obtain a partial match.

This case does not fit Sourcify's partial match definition because it's not a case where there's a mismatch in the metadata. This is why I consider this a bug.

acuarica commented 6 months ago

Hi @Neurone, thanks for taking time to investigate this.

In your example, when you set "appendCBOR": false it removes the metadata generation from the final bytecode. Thus the Example.bin does not contain any metadata. To verify this, paste the bytecode into or paste contract bytecode box of https://playground.sourcify.dev/. You will see the error Unable to decode bytecode with CBOR.

This case does not fit Sourcify's partial match definition because it's not a case where there's a mismatch in the metadata.

Given there is no metadata available on-chain, there is no metadata to compare. That's why it reports it as partial match.

Neurone commented 6 months ago

Yeah, that is because Sourcify's lacks useful features for smart contracts validation, so they don't understand there's no metadata to compare ;) Or, even if the metadata reporting the appendCBOR:false is provided by the user, still Sourcify is unable to derive it from the source code, as any verifier can do instead.

acuarica commented 6 months ago

so they don't understand there's no metadata to compare

But there is no metadata to compare on-chain, which is the only source of information to retrieve the metadata from.

still Sourcify is unable to derive it from the source code

They can derive a metadata (from solc) but given there is no metadata on-chain, the source code you provided might differ in comments, variable names, source map (opcode <--> ast node), etc. Hence the partial match.

as any verifier can do instead.

Can you provide an example? that would be great.

Neurone commented 6 months ago

Metadata, source map, etc., is generated from the source code and the compiler parameters.

Elements that don't contribute to the final bytecode are irrelevant when operating a smart contract verification.

As a verifier, I want source code and compiler parameters. If I can recreate the same bytecode I see on-chain, bit-by-bit, it means I have a valid and usable source code and metadata.

By definition, it's irrelevant if variable names or comments are different from those used by the original author. But if smart contract authors care instead, they can set compiler parameters accordingly.

It's up to the authors to decide these options, not to any validators like Sourcify.

If the smart contract authors don't want metadata attached to their bytecode, they set the compiler's options to do it. If they want the smart contract optimized, they set the options to do so. If they want to compile with a specific compiler version, they set the options to do so.

It's not Sourcify's - or any validator's - duty to define what is good and what is bad; they should provide a good implementation of the verification process and say if the verification was successful.

The verification process of a smart contract is always the same for everyone:

I write the source code
I run the compiler, creating the metadata file
I deploy the contract
I send you the source code and the metadata file
You run the compiler over the source code using the parameter in the metadata file and obtain the bytecode and the metadata file
You compare the bytecode on-chain and the metadata file generated
If they match with what you expect, the contract is entirely valid and 100% verified

Sourcify's implementation is limited because they consider full (perfect) match only cases they arbitrarily decided they care about - only those smart contracts with specific data attached at the end of the executable bytecode.

Etherscan, for example, has a better implementation.

Here is the Uniswap v3 router contract that Etherscan considers correctly a verified smart contract with an exact match.

That same smart contract cannot get a full (perfect) match with Sourcify at the moment because it does not contain metadata in the specific format requested by Sourcify's implementation.

This is not a problem per se, nor a problem for me; each service can assign arbitrary verification badges and develop the verification implementation differently, and we can develop our implementation or improve existing ones if we like.

And again, it's not a problem because it's not a matter of trust.

Users should not trust Etherscan, Sourcify, our service, or any other external validator; they should use tools that let them verify smart contracts autonomously.

If a validator - i.e., Sourcify - decides to arbitrarily reduce the options for the validation process, it's just a specific limited implementation with fewer features than other implementations.

Sourcify is not the source of truth for smart contract verification, and it's not the best implementation today.

It's something better than nothing, it's open source, and we are all grateful for that, but it can definitely be improved.

Nobody forces us to improve the current Sourcify implementation, and we can decide to stick precisely with that source code.

However, I cannot be convinced that their current implementation is the sole and only correct path to follow, and those who don't comply with their implementations or definitions are offering a worse service.

kuzdogan commented 6 months ago

Hey all. I've been trying to reach out to you (@svienot) through Matrix and Discord. I see you guys are working hard :) and think we should keep an open line of communication. What would be the best way to connect?

acuarica commented 5 months ago

Hey @Neurone, interesting discussion so far 😄 . I guess the core point where we disagree is

Elements that don't contribute to the final bytecode are irrelevant when operating a smart contract verification.

My view here aligns with that of Sourcify's. The statement above should only apply to partial matches. Full matches should include other elements that don't contribute to the final bytecode. A good argument can be found here[1]

If you verified that the code you deployed is an exact match to what you open-sourced that is great! Developers now can verify themselves that your dapp is actually doing what it promised.

However, it will still be hard for end-users to understand what they are doing when interacting with your code. Only by translating what the contract interactions mean into NatSpec you can help them comprehend what's going on under the hood. To ensure that you don't comment code with inadequate comments to mislead the users, all of the comments and metadata also get published and verified. Malicious actors could of course still use "wrong" NatSpec comments, however, Sourcify would guarantee that the comments match those used at compilation time. Other developers could now check the NatSpec comments in the open-source code and verify their accuracy.

[1] https://soliditylang.org/blog/2020/06/25/sourcify-faq/ § How is source verification & commenting your code with NatSpec connected and why is it more powerful if you do both?

acuarica commented 2 months ago

Hi @Neurone, as are closely aligned with Sourcify verification strategy, we are planning to use their images directly https://github.com/hashgraph/hedera-sourcify/issues/147 (instead of having our own custom images). Thus, I'm not sure how we can move forward with this issue. Let me know if it's ok for you to close this.

In any case, thanks for opening this, it has been a valuable conversation.

Neurone commented 2 months ago

Hi @acuarica, I don't have any other argument to add to this issue. If the decision is not to implement this, yes, we should close this with the motivation that we want to strictly adhere to the Sourcify verification rules.

hashgraph / hedera-sourcify