makerdao / spells-mainnet

Staging repo for MakerDAO weekly executive spells
GNU Affero General Public License v3.0
107 stars 43 forks source link

Refactor: change `testAuth*` strategy #374

Closed amusingaxl closed 5 months ago

amusingaxl commented 6 months ago

Context

A couple of months ago, Dewiz wrote a blog post exploring the execution costs of the spell test suite.

During the analysis, we found out that the testAuth and testAuthInSources test cases alone correspond to the majority of the requests made against the RPC node.

Those tests aim to prevent the deployer of contracts being added to the protocol from keeping wards privilege over them.

The main issue is that they are structured as a "shotgun" solution: the tests run every time (even if there are no new contracts), iterate over the entire chainlog (which contains 400+ contracts on Mainnet) and over the deployers list on addresses_deployers.sol (20+ addresses) in a nested loop.

That alone leads to 8000+ unique requests to the RPC provider, which ends up being heavily throttled, leading to very large execution times.

Proposal

To alleviate the problem, I propose we change testAuth* to a more incremental approach: contracts are only checked for dangling wards when they are first added to the protocol.

If the permissions are correctly set during the onboarding, the only way to modify the wards of a given contract is through another spell, however the current spell review process would not allow for that. In other words, there is no additional gain in terms of security bought by checking every single contract against every single deployer address on every single test run, we just need to ensure that the onboarding is properly done.

To make sure this step won't be missed, there's an accompanying PR on https://github.com/makerdao/pe-checklists/pull/17 to explicitly instruct crafters and reviewers to include such tests when they are required.

wei3erHase commented 6 months ago

Outsider comment, curious about your opinion, totally agree that these tests are making the testing suite last A LOT more and unresponsive until they're finally checked (~3m in my computer), also creating a bunch of requests to the RPC that could trigger a DoS when using a free service.

I wonder if we could think of a more holistic approach, that doesn't involve adding even one more check to both crafter and reviewer tasks, considering that, as long as these tasks keep growing, we can expect to each of them be "less important" to check, and perhaps to be overlooked in some PR. I see the handling of private vs public in tests a human hazard that could be avoided.

Questions on the underlying rationale of these tests (again, have nothing to do with this, so feel free to answer or ignore):

And the more holistic approach comment, perhaps to be implemented with another workflow i'm thinking about (gonna open an issue about):

wei3erHase commented 6 months ago

Submitted a PoC PR in #375 @amusingaxl I'm eager to learn what do you think about that process.

amusingaxl commented 6 months ago
  • What's the point on checking 20 addresses, while an address that's not on the list won't be checked

Every wallet that deploys contracts to be used in Maker Protocol MUST be added to the addresses_deployers.sol file. If the crafter fails to do so, there's a specific item in the review checklist that would catch that and the spell would not be approved until it's fixed.

There's no easy way to perform an on-chain check that no other address is present in wards due to the non-enumerability property of mappings. The closest we could get would be checking events emitted by the contract if they are fully compliant with Maker standards: there should be a corresponding Deny event with the same address in the payload for every Rely, except the Rely on MCD_PAUSE_PROXY.

The process is designed to work with this whitelist because any address outside of it SHOULD NOT be allowed. Of course, since the list is manually kept, there's room for human error, which we try to mitigate by having a comprehensive checklist that must be strictly followed and a minimum requirement of 2 reviewers to be able to ship the code.

  • What if the deployed contract chooses to revert when called with any of the deployers? (the test will pass as success)

This check is only relevant for contracts that have special permissions within the Maker Protocol. A contract with such would never get past internal reviews or external audits.

External contracts never have any permissions in the system, so even in the unlikely scenario a contract behaves like that (it would be suspicious to say the minimum), there would be no effect, so the test output can be ignored.

Regarding your suggestions, I'll answer in the PR you opened.

amusingaxl commented 6 months ago

i think the danger of this design is that it increases the trust assumptions on the correct completion of the added steps of https://github.com/makerdao/pe-checklists/pull/17 and the effect would be "a test not running"

Notice that this is not the only check performed. On L110 there's a manual check for dangling permissions.

Regarding the effect of a "test not running", I believe we can leverage the skip(bool) trick to make it explicit. Reviewers would then see in the test output that some tests were skipped, which would reduce the chances of overlooking the problem.

amusingaxl commented 6 months ago

I want to ask if you considered using multicall pattern to potentially speed up the tests without modifying their logic and the spell process?

I am not aware that we can use multicall within Foundry directly. We could potentially write a proxy server that would collect the requests, build a multicall and then answer those, but that seems like a huge effort in terms of maintenance.

Add new _checkWardsOrSrcWards that would automatically check if wards are present in the contract or in the source and check them, instead of requiring developer to pick the right function

I don't oppose this idea directly, however I don't believe there is a lot to gain here.

One of the reasons why I renamed the test from testAuthInSources to testOsmAuth is that this pattern is only applicable to OSMs, which have an underlying Median contract with the actual price feed. In the checklist PR I even added a note stating that the test is only relevant to OSM instances.

OSMs are being deprecated, the next generation of oracles will not feat them. Also, the latest PIP added to the system was PIP_MKR. It does not have an OSM, even though it is still built under the current version of oracles.

I believe it's very unlikely that we will onboard new OSMs to the protocol at this point, which means this specific test could potentially be removed in the future.

Run _checkWardsOrSrcWards inside _checkChainlogKey to always check for wards in all newly added contracts, instead of requiring developer to add a separate test

That's not a bad idea. The only drawback is that the check is not explicit anymore, and the test would serve multiple purposes. We could potentially move the auth tests closer to testNewChainlogKeys and make their structure similar.

Alternatively or additionally to the above, we can also run _checkWardsOrSrcWards on all chainlog addresses as soon as there were calls to the chainlog from the spell (e.g. detecting it via vm.expectCall)

Please correct me if I'm wrong, but vm.expectCall is meant to be used as an assertion that a specific call happened, not as a hook to react to a specific call.

SidestreamColdMelon commented 6 months ago

I am not aware that we can use multicall within Foundry directly

Multicall is just a contract that can batch generic read requests together. Maker even have a set of maker-developed and maker-deployed contracts here. I think that packing all ward(deployer) requests to the same address together into a single call would be possible as well as the other way around.

Run _checkWardsOrSrcWards inside _checkChainlogKey to always check for wards in all newly added contracts, instead of requiring developer to add a separate test

The only drawback is that the check is not explicit anymore

I don't agree that it's a drawback, since it specifically meant to prevent human error. _checkChainlogKey is also generic name which does not need to be changed to do this additional check. It would then not require changes to the process and following archive patterns would still comply with the change.

Please correct me if I'm wrong, but vm.expectCall is meant to be used as an assertion that a specific call happened, not as a hook to react to a specific call.

vm.expectCall can be used inside try/catch to detect if any call to the chainlog contract was made inside the spell and then run additional checks. Maybe there is better way to check that, it is just a quick idea how to execute the full check only when it's needed.

amusingaxl commented 6 months ago

Multicall is just a contract that can batch generic read requests together

Oh, I thought you were referring to RPC multicall. Do you have one of these contract at hand so we could benchmark?


I'll push some changes addressing part of your suggestions.

amusingaxl commented 6 months ago

vm.expectCall can be used inside try/catch to detect if any call to the chainlog contract was made inside the spell and then run additional checks. Maybe there is better way to check that, it is just a quick idea how to execute the full check only when it's needed.

Oh, ok. I think I got the idea! You want to perform a full auth check on the chainlog if it has been modified to make sure we don't miss anything on testNewChainlogValues, right?

wei3erHase commented 6 months ago

Multicall is just a contract that can batch generic read requests together. Maker even have a set of maker-developed and maker-deployed contracts here. I think that packing all ward(deployer) requests to the same address together into a single call would be possible as well as the other way around.

I believe this is a good practice when dealing with a JS + Solidity environment, we actually use it on a lot of frontends, deploy a contract that batches the requests into a single call and let the RPC handle the rest.

function checkAuth(bytes32[] memory contractNames, address deployer) {...}

In Foundry, tho, since tests are already a contract call that does that, the benefits of "batching" are not present, the processing would happen or in this "batch contract", or inside the test method, in the end, is the same. The problem with these tests are really the amount of data to be read, +8000 storage reads.

I agree that the test wasn't also a line of defense vs unwanted authorizations happening, since it was only checking the deployer addresses, and currently like the current design, where these tests run when a contract is added to the Chainlog instead. 👍

amusingaxl commented 6 months ago

Regarding the vm.expectCall suggestion, apparently a failure there is not catchable.

I'm trying locally with:

    function _executeSpell() external {
        _vote(address(spell));
        _scheduleWaitAndCast(address(spell));
        assertTrue(spell.done());
    }

    function _checkSpellDoesNotModifyAddressesInChainlog() external {
        vm.expectCall(address(chainLog), abi.encodeWithSelector(chainLog.setAddress.selector), 0);
        this._executeSpell();
    }

    function _testFullAuth() internal {
        uint256 initial = vm.snapshot();
        try this._checkSpellDoesNotModifyAddressesInChainlog() {
            console2.log("[testFullAuth] Chainlog not modified, skipping...");
            return;
        } catch {
            console2.log("[testFullAuth] Chainlog modified, running...");
        }
        vm.revertTo(initial);

        _vote(address(spell));
        _scheduleWaitAndCast(address(spell));
        assertTrue(spell.done());

        bytes32[] memory keys = chainLog.list();
        for(uint256 i = 0; i < keys.length; i++) {
            console2.log("[testFullAuth] Checking key:", string(abi.encodePacked(keys[i])));
            // _checkAuth(keys[i]); // skiping this for testing purposes
        }
    }

If the spell violates _checkSpellDoesNotModifyAddressesInChainlog, the test will fail regardless of the outcome of the remaining code, but the execution path will follow the try block, not catch and it will log this message:

Chainlog not modified, skipping...

I even tried to put calls in external helper functions to force the CALL opcode to be used, but no luck so far.

SidestreamColdMelon commented 6 months ago

I believe this is a good practice when dealing with a JS + Solidity environment

Yes, there are even libraries which nicely do those calls for you. I also know this approach from the frontend world where it makes sense for multiply reasons (getting info on the same block, saving request quota, etc). But looking at your explanation I'm not yet convinced it wouldn't help. From the investigation we know that the main bottleneck is throttling of the requests that happening on the provider side. Multicall suppose to help with that.

Do you have one of these contract at hand so we could benchmark?

I think you can use the address mentioned here or the newer multicall3 address.

You want to perform a full auth check on the chainlog if it has been modified to make sure we don't miss anything on testNewChainlogValues, right?

Yes, correct. Having checks inside _checkChainlogKey already improves the process. Having an automatic check of everything is even better, if we want to be extra cautions.

Regarding the vm.expectCall suggestion, apparently a failure there is not catchable.

Thanks for the code, I will also look into it 🙂

wei3erHase commented 6 months ago

If the spell violates _checkSpellDoesNotModifyAddressesInChainlog, the test will fail regardless of the outcome of the remaining code, but the execution path will follow the try block, not catch and it will log this message:

With vm.expectCall test will [FAIL], but the execution of the call will not revert, so it will not follow the catch path if there's no call. If I recall correctly, there's no way of "getting" the amount of calls in Solidity (which would be processable into a revert), just a cheatcode to expect them (and pass/fail the test).

There is tho, if willing to do an extra test, a way of calling vm.expectCall(..., n) being n the total amount of calls made, comparable with the length of the array of keys to input to the test, verifying "n amount of addresses were added to the chainlog" (and not more).

amusingaxl commented 6 months ago

I made some additional changes to reduce some redundancy when testing the chainlog.

Please notice that testChainlogValues already checks the chainlog version and that all key-value pairs on-chain match the entries in addresses_mainnet.sol.

There is no need to do that again for keys being added or updated, so the only additional check we need is for the auth in the updated contracts.

Following @wei3erHase suggestion, I added a vm.expectCall to ensure chainLog.setAddress is called the exact amount of times as the number of modified keys. That prevents us from missing new keys.

Last, but not least, I improved the _testChainlogVersionBump so any modifications made in the chainlog that do not produce a version bump will fail. There's a small chance for a false positive here, when the spell removes the key and add it again, but I believe that is something we want to avoid anyway.

amusingaxl commented 5 months ago

With the latest commit, I think I managed to eliminate the human error factor from the equation. The new testChainlogIntegrity test case should be able to automatically cover everything we've been discussing so far.

It supersedes testChainlogVersionBump and also checks for dangling wards in any updated values in the chainlog.

Please take a look and share your feedback.