sherlock-audit / 2024-06-makerdao-endgame-judging

1 stars 1 forks source link

mrhudson890 - `VoteDelegate`'s `reserveHatch` allows multi-block MEV to grief LSE users during time-sensitive voting periods #95

Closed sherlock-admin3 closed 1 month ago

sherlock-admin3 commented 1 month ago

mrhudson890

Medium

`VoteDelegate`'s `reserveHatch` allows multi-block MEV to grief LSE users during time-sensitive voting periods

Summary

LockstakeEngine (LSE) and VoteDelegate (VD) interaction allows malicious actors to exploit multi-block MEV (MMEV) which is possible for roughly 30% of blocks. This allows griefing users of significant funds, and disrupting time-sensitive governance functions. By manipulating VD's reserveHatch()(RH) in the first block of MMEV, attackers can cause users' LSE transactions (selectVoteDelegate, lock, lockNgt) to fail during crucial voting periods, potentially leading to losses exceeding 0.5%, or even 5% of user value in gas fees. This exploit not only results in financial losses but also prevents time-sensitive voting delegations, manipulating governance outcomes, especially in emergency situations. The non-upgradeable nature of the system amplifies the long-term impact given MMEV.

The cost to attacker is minimal: 28K gas, while the cost to user is much larger, e.g., 900K gas for a large multicall batch. At 200 gwei and 3500 USD per ETH, this translates to attacker cost (3500×28000×200÷10^9): 19 USD, and user cost of 630 USD (more than 5% for 10K user value) in the case of a large multicall batch.

Root Cause

The VoteDelegate contract includes a reserveHatch (RH) mechanism that, when called, prevents locking (entering) for a set number of blocks (5), starting from the next block. This prevents single block MEV griefing.

function lock(uint256 wad) external {
    require(block.number == hatchTrigger || block.number > hatchTrigger + HATCH_SIZE,
            "VoteDelegate/no-lock-during-hatch");
    ...
}

However, this does not prevent MMEV: MEV over consecutive blocks controlled by a single builder or proposer. This is exacerbated by the use of Multicall which will cause multiple batched user actions to all fail, wasting a very high amount of gas fees.

The LockstakeEngine uses this lock function in its _selectVoteDelegate internal method, that's used throughout selectVoteDelegate, lock, and lockNgt user facing functions:

function _selectVoteDelegate(address urn, uint256 wad, address prevVoteDelegate, address voteDelegate) internal {
    if (wad > 0) {
        // ...
        if (voteDelegate != address(0)) {
            // ...
            VoteDelegateLike(voteDelegate).lock(wad);
        }
    }
    // ...
}

This design allows an attacker to exploit the reserveHatch mechanism to prevent users from executing these functions, particularly during critical voting periods.

This griefing vector is exacerbated by the fact that a Multicall batch, carrying potentially several gas expensive user actions, will fail when one of the vulnerable actions is blocked. This is because calling RH can be as cheap as 28K gas, but the loss for the user can be much higher, potentially Millions in gas units (depending on the batched operations). This creates a very high griefing factor (cost to damage ratio) of e.g., 3200% griefing ratio (damage / cost) for a 911K batch (from the PoC).

The vulnerability is enabled by the current state of Ethereum's block production, where a small number of builders control a large portion of block production (54% for beaverbuild alone), enabling multi-block MEV attacks via builders (controlling many consecutive blocks) as well as individual proposers (controlling significant stake and occasional consecutive blocks). For example coinbase controlled proposers control 15% of the stake right now.

If a builder controls 54% of the blocks, the chances of the next block being controlled by them is 54%, which translates to 29% for any given block. Thus, if supported by beaverbuild, MMEV would enable this vector for 29% of all blocks. If MMEV becomes widely adopted by MEV infra software, this could rise to as much 60%, since MEV infra is used by around 80% of stake.

While this finding focuses on selectVoteDelegate, lock, and lockNgt, it's worth noting that free, freeNGT can also be blocked by frontrunning with a deposit into the VoteDelegate due cheif's flashloan protection. This opens even more MMEV options, due to the ability to block multicalls selectively, by either targeting selectVD / lock actions, or targetting the free actions for different batches. Both attack paths are linked via the tradeoff imposed by the reserveHatch and cheif functionality between only being able to either free or lock during a single block.

Internal pre-conditions

Target VD's RH was not called prior to this block. This is highly likely, since calling RH must be planned, and requires the user to then wait for 6 blocks (more than a minute) before sending their intended transactions (due to the block).

External pre-conditions

A builder supporting MMEV, OR a single proposer controls two consecutive blocks.
Gas prices are high.

Attack Path

Scenario 1: Builder-based MMEV Attack

An attacker observes queued mempool multicall transactions involving delegation updates in LSE during a time-sensitive voting period.
The attacker users a builder supporting MMEV to execute the following:
- Block N: Include the attacker's reserveHatch() call to the target VoteDelegate.
- Block N+1: Include the user's multicall transaction, which will fail when it tries to execute lock() in the VoteDelegate.

Scenario 2: Proposer-based MMEV Attack

A malicious proposer is scheduled to propose two consecutive blocks during a time-sensitive voting period and observes vulnerable transactions in the mempool.
The proposer executes:
- Block N: Include the attacker's reserveHatch() call.
- Block N+1: Include the user's transaction, which will fail.

Scenario 3: Selective Governance Attack

A controversial or emergency governance proposal is submitted.
The attacker repeatedly executes the reserveHatch attack during the voting period when possible (outside of cooldown periods) against users trying to vote against the attacker's proposal.
Many users will have their transactions fail if not submitted during RH cooldown periods.
This griefs a large amount of user, increases the user costs for governance participation for one outcome over the over, increasing the chance of the attacker influencing the vote.
This delays users' time-sensitive vote delegations in times of emergency.

Impact

Loss of funds: Users lose significant gas fees when their time-sensitive transactions fail. Depending on gas prices and complexity of multicall transactions, losses could amount to hundreds of dollars per failed transaction. For example, at 3500 USD per ETH, and 200 gwei gas price:
- Attack cost: 28K gas, 19 USD.
- Large multicall revert cost (PoC - 8 ops): 911K gas, 637 USD
- Smaller multicall revert cost (PoC - 5 ops): 709K gas, 496 USD
- No multicall, islocated selectVoteDelegate revert: 101K gas, 70 USD
Governance Interference: The attack can prevent users from updating their vote delegations during critical periods, potentially swaying governance decisions. This is particularly impactful for emergency proposals that require quick action.
Long-term Participation Decline: Consistent attacks could discourage users from participating in governance, leading to more vulnerable governance due to lower honest user participation.
Systemic Risk: Given that the LockstakeEngine is designed to be non-upgradeable and used for multiple years, this vulnerability poses a long-term risk to the system's integrity and decentralization due to the likelihood of MMEV prevalence.

PoC

The PoC measures the gas costs for:

Calling reserveHatch (attacker): 28K gas
Gas costs for the griefed user for isolated selectVoteDelegate revert: 101K gas
Gas costs for a 5 ops multicall batch revert: 709K gas
Gas costs for a 8 ops multicall batch revert: 911K gas

The PoC adds tests for LockstakeEngineBenchmarks class in Benchmarks.t.sol file in this measurement PR https://github.com/makerdao/lockstake/pull/38/files of branch https://github.com/makerdao/lockstake/tree/benchmarks. I've also merged latest dev into it to ensure it's using latest contracts.

The output of the PoC:

>>> forge test --mc LockstakeEngineBenchmarks --mt testGas -vvv

Logs:
    reserveHatch cost: 28248
    multicall (5 ops) revert cost: 709731
    multicall (8 ops) revert cost: 911424

Logs:
    reserveHatch cost: 23594
    selectVoteDelegate revert cost: 101133

The added tests:

function testGasMulticallRevert() public {  
    uint startGas = gasleft();  
    VoteDelegate(voteDelegate).reserveHatch();  
    uint gasUsed = startGas - gasleft();  
    console2.log("  reserveHatch cost:", gasUsed);  

    vm.roll(block.number + 1);  

    mkr.approve(address(engine), 1_000_000 * 10**18);  

    address urn = engine.getUrn(address(this), 0);  

    bytes[] memory data = new bytes[](5);  
    data[0] = abi.encodeCall(engine.open, (0));  
    data[1] = abi.encodeCall(engine.lock, (urn, 100_000 * 10**18, 5));  
    data[2] = abi.encodeCall(engine.draw, (urn, address(this), 2_000 * 10**18));  
    data[3] = abi.encodeCall(engine.selectFarm, (urn, address(farm), 0));  
    data[4] = abi.encodeCall(engine.selectVoteDelegate, (urn, voteDelegate));  

    vm.expectRevert();  
    startGas = gasleft();  
    engine.multicall(data);  
    gasUsed = startGas - gasleft();  
    console2.log("  multicall (5 ops) revert cost:", gasUsed);  

    data = new bytes[](8);  
    data[0] = abi.encodeCall(engine.open, (0));  
    data[1] = abi.encodeCall(engine.selectFarm, (urn, address(farm), 0));  
    data[2] = abi.encodeCall(engine.lock, (urn, 100_000 * 10**18, 5));  
    data[3] = abi.encodeCall(engine.draw, (urn, address(this), 1_000 * 10**18));  
    data[4] = abi.encodeCall(engine.lock, (urn, 50_000 * 10**18, 5)); // simulates lockNgt  
    data[5] = abi.encodeCall(engine.draw, (urn, address(this), 1_000 * 10**18));  
    data[6] = abi.encodeCall(engine.getReward, (urn, address(farm), address(this)));  
    data[7] = abi.encodeCall(engine.selectVoteDelegate, (urn, voteDelegate));  

    vm.expectRevert();  
    startGas = gasleft();  
    engine.multicall(data);  
    gasUsed = startGas - gasleft();  
    console2.log("  multicall (8 ops) revert cost:", gasUsed);  
}  

function testGasSelectVDRevert() public {  
    address voter2 = address(1234);  
    vm.prank(voter2);  
    address voteDelegate2 = delFactory.create();  

    address[] memory yays = new address[](5);  
    for (uint256 i; i < 5; i++) yays[i] = address(uint160(i + 1));  
    vm.prank(voter); VoteDelegate(voteDelegate).vote(yays);  
    vm.prank(voter2); VoteDelegate(voteDelegate2).vote(yays);  
    address urn = _urnSetUp(true, true);  

    uint startGas = gasleft();  
    VoteDelegate(voteDelegate2).reserveHatch();  
    uint gasUsed = startGas - gasleft();  
    console2.log("  reserveHatch cost:", gasUsed);  
    vm.roll(block.number + 1);  

    vm.expectRevert();  
    startGas = gasleft();  
    engine.selectVoteDelegate(urn, voteDelegate2);  
    gasUsed = startGas - gasleft();  
    console2.log("  selectVoteDelegate revert cost:", gasUsed);  
}

Mitigation

To mitigate this issue, the VoteDelegate contract's reserveHatch mechanism should be modified to introduce a longer delay before the hatch takes effect. Instead of starting from the next block, it should start after M blocks, where M is a governance-controlled parameter. The parameters can be stored on the VD factory, and the VDs can query it there. This makes the attack more expensive and less likely to succeed, as controlling M consecutive blocks is more expensive and less likely.

The total time the hatch is open and the duration of the cooldown cycle is not changed. The only modification is the timing of the hatch in the cycle. Instead of the cycle being 25 blocks starting with the 5 hatch blocks, the 5 hatch blocks are moved later in the cycle.

Proposed change:

function lock(uint256 wad) external {
-        require(block.number == hatchTrigger || block.number > hatchTrigger + HATCH_SIZE, "VoteDelegate/no-lock-during-hatch");
+        uint256 hatchDelay = factory.getHatchDelay(); // set by gov on the factory
+        bool beforeHatch = block.number < hatchTrigger + hatchDelay;
+        bool afterHatch = block.number > hatchTrigger + hatchDelay + HATCH_SIZE;
+        require( beforeHatch || afterHatch, "VoteDelegate/no-lock-during-hatch");
        ...
    }

Governance can adjust hatchDelay based on the likelihood of MMEV issues.

sunbreak1211 commented 1 month ago

The reserve hatch mechanism includes both a period for allowing locks (15 blocks) and a period for allowing frees (5 blocks). Thus if an action is critical enough, its issuer can do it in the respective sequence of blocks where it should succeed regardles of MMEV griefing. It is also stated in the submission that the lockstake engine does not have an upgrade path, which it has. So in case MMEV griefing attacks such as the ones described become a real problem in the Ethereum ecosystem and in lockstake, a migration to another contract can be done.

mrhudson890 commented 1 month ago

I'd like to escalate.

Both direct financial loss, and broader risk to protocol governance are likely and are not addressed by the sponsor's response.

Impact unchallenged

The griefing impact was not disputed by the sponsor, instead they pointed out a workaround either for the user to use after being griefed, that is unlikely to be used prior to griefing, and which was already mentioned in the submission:

Internal pre-conditions

Target VD's RH was not called prior to this block. This is highly likely, since calling RH must be planned, and requires the user to then wait for 6 blocks (more than a minute) before sending their intended transactions (due to the block).

However, this workaround is unlikely to prevent the griefing attack if the attacks are not already prevalent - because relies on otherwise wasteful and implausible user behavior: always submitting extra transactions, and waiting between sequential steps for a long period of time for no good reason (most of the time).

Unreasonable user behavior assumptions

The reserve hatch mechanism includes both a period for allowing locks (15 blocks) and a period for allowing frees (5 blocks). Thus if an action is critical enough, its issuer can do it in the respective sequence of blocks where it should succeed regardles of MMEV griefing.

The sponsor assumes one of two things:

Before the users submits their transaction, for their VD - which can be as numerous as users - reserveHatch for that user's VD in the preceding 20 blocks was called, so the attacker cannot call it. This is not the scenario raised. In my scenario, reserveHatch for that VD was not called prior.
- Note that while pre-emptive calls to RH on VDs of unsafe urns for liquidations are reasonable (unsafe urns are known). It's impossible to periodically call RH on all existing VDs - since it's impossible to predict where regular users would like to delegate.
- There's no reason that all existing VD's RH are always called.

So that assumption is invalid for the scenarios covered in the report. The other alternative:

The user is assumed, before their intended transaction, to always, and without suspecting anything:
- First call reserveHatch on their VD
- Wait for at least one minute (5 blocks) for the cooldown period to begin
- Only then submit their transaction

This assumption is highly unreasonable and is unlikely wasteful user behavior - instead of submitting a transaction, submitting a "pre-transaction" transaction, waiting for more than a minute, and only than submitting the actual needed transaction.

Specifically, the only circumstances where users will behave in this way is only if a substantial MMEV grieving attack is already taking place. However, if we must assume the attack has happened for the workaround to be plausible - the attack is validated.

Coordinated occasional attacks

The most destructive and motivated attack would be if these attacks be coordinated around specific time as detailed in Scenario 3. The attacks would happen in the time preceding some specific votes: especially controversial or emergency actions, in a selective way, and because of the usual user behavior will grief a large amount of unprepared users as well as influence the vote.

These attacks will be especially effective, since rare (but critically timed) attacks will ensure most users are not engaging in wasteful "pre-transaction evasive maneuvers" - and so are vulnerable in large numbers during the attacks.

Lack of upgrade path

It is also stated in the submission that the lockstake engine does not have an upgrade path, which it has. So in case MMEV griefing attacks such as the ones described become a real problem in the Ethereum ecosystem and in lockstake, a migration to another contract can be done.

There isn't a straightforward upgrade path for the LSE mechanism. The only two directly upgradeable contracts in the contest scope are NST and SNST due to usage of UUPS proxy pattern.

The LSE system, with it's per user LSUrn, that holds vat debt and vat collateral does not have an easy upgrade path. If the LSE logic needs to be upgraded, a painful and complex migration will need to take place. Again, not only MKR and NGT balances will need to be moved, but also all the debt for each single urn will need to be moved with its collateral, among other things.

Thus, in an extreme case, and extremely painful and expensive migration is possible, but it's unlikely to be done to resolve griefing attacks. Instead users will suffer from griefing, and Maker will suffer due to reduced voter participation and reduced trust.

zzykxx commented 1 month ago

Escalate

Escalating on behalf of @mrhudson890, refer to his/her comment.

sherlock-admin3 commented 1 month ago

Escalate

Escalating on behalf of @mrhudson890, refer to his/her comment.

You've created a valid escalation!

To remove the escalation from consideration: Delete your comment.

You may delete or edit your escalation comment anytime before the 48-hour escalation window closes. After that, the escalation becomes final.

WangSecurity commented 1 month ago

Could you elaborate on the multiblock part of your submission? I don't see a multiblock MEV in the attack paths and only a one-time block. I see that you're saying it may have a very large impact on emergency voting, but for this issue to actually have an affect, the voting has to be several minutes or even less (again depends on how many blocks is meant by multiblock).

Moreover, is there a material loss besides paying the gas for failed transaction?

mrhudson890 commented 4 weeks ago

@WangSecurity thanks for your comment and questions.

Could you elaborate on the multiblock part of your submission? I don't see a multiblock MEV in the attack paths and only a one-time block.

Multiblock here refers to the MEV type. It's the ability to extract MEV - control the order of transactions - in two or more consecutive Ethereum blocks. This is what allows the attacker to grief the user because reserveHatch activates in the next block, and for 5 more blocks (until cooldown) the user will experience an unexpected revert (because re-ordering after attacker's RH call). In the report, the attack paths' N and N+1 blocks refer to the first and second blocks controlled by the attacker. They will move the user's call from N to N+1, and place their RH call in N.

I see that you're saying it may have a very large impact on emergency voting, but for this issue to actually have an affect, the voting has to be several minutes or even less (again depends on how many blocks is meant by multiblock).

Every attack blocks one user transaction (and wastes their gas) for at least 1 minute (6 blocks until cooldown), but the total impact on voting is over a longer period, due to different users (with different VDs) voting and being griefed at different times during the time-sensitive voting period. Please see more detailed scenario below.

Detailed scenario for selective governance attack (Scenario 3 in the report)

I'll expand the scenario using the recent attack on Compound DAO, and the previous attacks on Balancer DAO and Sushiswap DAO .

Humpy's attack on Maker DAO:

Extractive proposal to "GOLD-MKR initiative". Contentious, but Humpy controls a large amount of MKR.
Initially 1000 community voters, varying levels of conviction.
High gas fees.
Without griefing Humpy would be outvoted 55%-45%.
Humpy selectively targets users that vote against him. Users delegate to different VDs at different times and are blocked by Humpy separately.
Many of the users are griefed. Some re-vote again in time (using the "RH + wait 1 minute + re-vote" workaround).
Some users are deterred by the griefing loss (and high gas prices) from re-voting.
Some more determined users do not vote in time because: the workaround takes additional time to apply, and it takes the different users additional time to investigate and figure out what has happened and what to do.
Only 700 out of the 1000 community votes succeeded.
Instead of losing with 55-45 split, Humpy wins with 49-51 split.

The two impacts are demonstrated here:

Possible substantial loss to MakerDAO due to the governance attack, in either this, or the next attack.
Certain and material griefing loss to many users - whether Humpy succeeds this time or not, whether all users re-vote in 1 minute or not. Many users have lost substantial funds.

Rational users

Expanding why the sponsor response ("there's a workaround") does not address the issue.

Because users are rational, they will not user the workaround proactively - it is more expensive, complex, and takes more time. If the attacks are not frequent enough, rational users will just default to submit their intended transaction. This is because when they submit, their wallet's simulation will pass. Only after they submit, it will be MEVed to revert.

Griefing is in scope

Moreover, is there a material loss besides paying the gas for failed transaction?

The griefing loss is material and substantial on itself (separately from the governance attack) - possibly more than 5% of the users' value (up to 630 USD in the PoC).

Furthermore, the griefing attack is explicitly in scope by both Sherlock and MakerDAO specific scope rules:

Sherlock rules:

Griefing for gas (frontrunning a transaction to fail, even if can be done perpetually) is considered a DoS of a single block, hence only if the function is clearly time-sensitive, it can be a Medium severity issue.

Because voting in involved, the transaction is clearly time-sensitive for contentious, emergency, and even regular votes.

MakerDAO specific contest rules explicitly include griefing for ratios (damage/cost) above 66% (100 / 150):

Issues where there is damage to the protocol/users but the net attack cost exceeds the damage caused significantly (50%+ more) are considered low severity.

As shown in the report, the damage ratio for this attack exceeds this low threshold, at up to 3200% in the PoC. This is because calling RH by the attacker is much cheaper than the attack's damage to user.

To conclude, the losses are material and likely for both of the impacts. The sponsor's response does not address / mitigate the attack. The griefing loss by itself explicitly qualifies by scope rules.

WangSecurity commented 4 weeks ago

Firstly, the loss of gas is not a "loss of funds".

Secondly, about the multi-block MEV. it allows to DOS the transaction for 2-3 blocks. To cause a larger impact or a constant DOS, the attack has to be repeated. Hence, it falls under this rule:

Griefing for gas (frontrunning a transaction to fail, even if can be done perpetually) is considered a DoS of a single block, hence only if the function is clearly time-sensitive, it can be a Medium severity issue.

Thirdly, this function being time-sensitive. As I know, the emergency voting is at least a few hours. Hence, DOSing users for 2-3 blocks is not enough to cause a discrepancy in voting leading to a loss of funds.

I believe this issue is low severity since it's about blocking the voting process for 2-3 blocks. Planning to reject the escalation and leave the issue as it is.

Note: I see there's a Humpy scenario that leads to a "loss of funds" via influencing the voting process. But, it should've been in the report. The report only shows a loss of gas and how the users can be DOSed for 2-3 blocks.

mrhudson890 commented 4 weeks ago

@WangSecurity thanks for your comments, and I apologize for the long responses, but I have 4 comments on the above and would be grateful for your opinion on all 4.

1. Humpy attack in original report

Note: I see there's a Humpy scenario that leads to a "loss of funds" via influencing the voting process. But, it should've been in the report. The report only shows a loss of gas and how the users can be DOSed for 2-3 blocks.

It is in the original report, please see "Attack path > Scenario 3", it describes this exact scenario.

Scenario 3: Selective Governance Attack

A controversial or emergency governance proposal is submitted.

The attacker repeatedly executes the reserveHatch attack during the voting period when possible (outside of cooldown periods) against users trying to vote against the attacker's proposal.

Many users will have their transactions fail if not submitted during RH cooldown periods.

This griefs a large amount of user, increases the user costs for governance participation for one outcome over the over, increasing the chance of the attacker influencing the vote.

This delays users' time-sensitive vote delegations in times of emergency.

And then also in "Impacts":

Governance Interference: The attack can prevent users from updating their vote delegations during critical periods, potentially swaying governance decisions. This is particularly impactful for emergency proposals that require quick action.

Long-term Participation Decline: Consistent attacks could discourage users from participating in governance, leading to more vulnerable governance due to lower honest user participation.

2. DOS duration is 6 blocks

it allows to DOS the transaction for 2-3 blocks

The blockage (DOS) is for at least 6 blocks, or 1 minute and 12 seconds. This is because once reserveHatch is called, delegation is blocked until the cooldown period, which starts on 7th block after the reserveHatch call (next block after 1 + 5 reserve hatch blocks).

3. Time-sensitivity constraint

Thirdly, this function being time-sensitive. As I know, the emergency voting is at least a few hours. Hence, DOSing users for 2-3 blocks is not enough to cause a discrepancy in voting leading to a loss of funds.

As the blockage is for at least 1 minute and 12 seconds (6 blocks), a "last minute vote" can be blocked. While last minute votes are the most strict constraint, they are nevertheless very plausible. As support, I'd like to cite your opinion in https://github.com/sherlock-audit/2024-06-makerdao-endgame-judging/issues/107

@WangSecurity :

I see that the constraints are extreme, but still the issue is possible under special conditions which qualifies for medium severity.

The constraints in that issue (rare adjustment of dust + rare liquidations + liquidation of a just-right-dusty amount) are more extreme than the most extreme constraint in this issue: voting in last minute. Last minute voting is almost certain to occur for some users during any voting period. These users can be blocked with a 30% chance using this attack.

Additionally, time-sensitivity on longer time scale (e.g., last hour of voting) is also demonstrated for scenarios involving repeated blocking of a large amount of users (Scenario 3) due to the aggregated effect on a large amount of users.

4. Gas fees as loss of funds

Firstly, the loss of gas is not a "loss of funds".

Can you please explain why user losing a substantial amount of ETH due to malicious MEV is excluded from loss of funds?

I see it explicitly included in both Sherlock and MakerDAO contest rules as cited above in "Griefing is in scope" section of my my previous message.

WangSecurity commented 3 weeks ago

Yeah, you're right, excuse me for overlooking.
I see, thank you for clarifying, I've got confused by this:

Multiblock here refers to the MEV type. It's the ability to extract MEV - control the order of transactions - in two or more consecutive Ethereum blocks
I still believe that DOSing votes for only 1 minute 12 seconds is low severity.
Paying the gas is not a loss of funds on Sherlock. "Griefing" in Sherlock rules is not about making the user paying the gas. Griefing is when the attacker has no profit from the attack in the first place. I assume you refer to this in Sherlock rules:

Griefing for gas (frontrunning a transaction to fail, even if can be done perpetually) is considered a DoS of a single block, hence only if the function is clearly time-sensitive, it can be a Medium severity issue.

Note that the rule says the issue is only valid if the function is time-sensitive. Not because the user pays too much gas for a failed transaction.

Here, I believe DOS of 6 blocks for voting is not medium severity. Even if it's the last minute of voting, I believe 6 blocks is lot severity.

The decision remains, planning to reject the escalation and leave the issue as it is.

mrhudson890 commented 3 weeks ago

@WangSecurity I will respond within a day or so, thanks for you patience.

mrhudson890 commented 3 weeks ago

@WangSecurity

1. Time-sensitivity

I still believe that DOSing votes for only 1 minute 12 seconds is low severity. ... Here, I believe DOS of 6 blocks for voting is not medium severity. Even if it's the last minute of voting, I believe 6 blocks is lot severity.

This belief contradicts the rules, and is unsupported by any arguments. The rule says: "only if the function is clearly time-sensitive, it can be a Medium severity issue". "Clearly time-sensitive" applies to "1 minutes 12 seconds" just as to any other duration. Time-sensitivity is not a duration, but is difference in outcome because of time passing:

For scenarios 3: The total vote result can be different due to a large amount of griefed voters during a period of high gas prices. Leading to potentially loss of funds for the whole protocol.
For scenarios 1 and 2: The total vote result can be different due to a single last minute vote regardless of gas price. Leading to potentially loss of funds for the whole protocol.

While both short and long duration attacks are presented, both are time-sensitive, since both lead to loss of funds due to blocking functionality in a time-sensitive way.

2. Loss of gas fees

Paying the gas is not a loss of funds on Sherlock.

This claim also contradicts both sources of truth for the contest - Sherlock rules and MakerDAO contest docs (as detailed above ).

Both those sources of truth contain no exclusion of this type of loss, and explicitly cite specific constraints to satisfy for validity: time-sensitivity, and cost to damage ratios. Both these constraints are met.

Additionally, while not official sources of truth, these past findings further show the claim is not only wrong in principle, but also in practice. In all these issues the only impact was loss of gas fees, some issues were even resolved by you personally as loss of funds:

3 Scenario 3 - still unaddressed

Yeah, you're right, excuse me for overlooking.

This scenario was overlooked - and not addressed still, despite the apology.

Scenario 3 - "Humpy's" attack was discussed in the report. Then in further detail in comment. It shows time-sensitivity and two impacts of losses to MakerDAO and large amounts of users. The scenario is highly likely, as shown by the attacks on Compoound, Balancer, and Sushiswap.

Additionally, this scenario's time-scale is both significantly longer, and its impact is larger: both on more individual users, and on the protocol itself (loss to governance attack).

Given the overlooked Scenario 3, the clear time-sensitivity of the issue as defined by the rules, and the fact that gas fee losses are considered valid loss of funds by the rules, I respectfully request a reconsideration of your opinion. The evidence presented demonstrates that this vulnerability is by all rules a medium severity issue, posing significant risks to both individual users and the protocol as a whole.

WangSecurity commented 3 weeks ago

This belief contradicts the rules, and is unsupported by any arguments. The rule says: "only if the function is clearly time-sensitive, it can be a Medium severity issue". "Clearly time-sensitive" applies to "1 minutes 12 seconds" just as to any other duration. Time-sensitivity is not a duration, but is difference in outcome because of time passing: For scenarios 3: The total vote result can be different due to a large amount of griefed voters during a period of high gas prices. Leading to potentially loss of funds for the whole protocol. For scenarios 1 and 2: The total vote result can be different due to a single last minute vote regardless of gas price. Leading to potentially loss of funds for the whole protocol. While both short and long duration attacks are presented, both are time-sensitive, since both lead to loss of funds due to blocking functionality in a time-sensitive way.

Excuse me for the confusion, let me elaborate on what I mean. The duration of the DOS is only 1 minute 12 seconds. The emergency votings are at least a few hours. In that context, I believe that blocking the several-hour voting for 6 blocks is not time-sensitive. Even if it's last-minute voting, I believe in the context of the vote lasting for several hours it's not time-sensitive.

This claim also contradicts both sources of truth for the contest - Sherlock rules and MakerDAO contest docs

This doesn't contradict any sources of truth, none sources of truth claim that griefing for gas, i.e. making the victim pay for the failed transaction is a valid loss of funds. Gas is the essential part of making transactions and each user has to pay the gas, even if the transaction fails. I believe paying the gas is not a medium-severity loss of funds. Especially, in the case that all other funds of the user are safe.

About Sherlock rules and MakerDAO contest docs, I may not elaborated well enough last time, griefing the transaction to fail can be valid not based on the gas paid, but on the larger impact it has, i.e. trading, trading DOSed for even 1 block can be significant. In this case, it's not, even though we consider Ethereum mainnet, the loss of gas fees is not considered a valid loss of funds.

Additionally, while not official sources of truth, these past findings further show the claim is not only wrong in principle, but also in practice. In all these issues the only impact was loss of gas fees, some issues were even resolved by you personally as loss of funds

Most of the findings you shared were not judged by me, but from what I see the problem there is not it gas paid, but in the fact that it should've been repaid to the keeper (e.g. in Elfi protocol) but it wasn't. In this case, the gas fees shouldn't be repaid to anyone, hence, it's not a loss of funds but how the blockchain works. About the 2 findings that were judged by me (2 escalations from Elfi), they do not talk about loss of gas fees, but about lost keeper fees that should've been paid. Hence, they're irrelevant in this discussion. Additionally, historical decisions are not sources of truth.

This scenario was overlooked - and not addressed still, despite the apology

Indeed, reflecting on it now: Even if the attacker targets the specific voters, the DOS is only 1 minute 12 seconds for several hours of voting. Hence, I believe we cannot claim it's time-sensitive because users have several hours to vote. The repeatability of the attack is not considered because for this to have a constant impact, the DOS has to be repeated. Hence, it's considered a one-time DOS. The users losing gas fees are also not considered here as discussed above.

The decision remains as it is, planning to reject the escalation and leave the issue as it is.

mrhudson890 commented 3 weeks ago

@WangSecurity Thank you for your detailed explanation. Your position is much clearer now. Some crucial points given your clarifications:

Misinterpretation of time-sensitivity:

While the voting period is longer, users do not expect to be blocked, so will vote at different times during the period - this is the assumption. Users cannot be assumed to vote ONLY before the last minute. Thus, a 1m12s block at the end of voting is time-sensitive, and could change the outcome and cause a large loss to the protocol - because those users cannot re-vote. In governance, last minute votes can be crucial, especially for close decisions - this is the essence of time-sensitivity. The impact is that the bug causes time-sensitive voting to be maliciously manipulated, without the ability to re-vote.

Cumulative impact and repeatability - a malicious way to gain "extra voting power":

While each instance is short, the attack is repeated throughout the voting period, leading to cumulative impact. This could significantly sway voting results, especially in close decisions, as shown in the attacks on Compound, Balancer, and SushiSwap. The repeatability and the effect on many voters shouldn't be dismissed as it's key to the increased impact of Scenario 3. The users deterred by the asymmetric (targeted) blocking - allow one side to maliciously influence the voting. This allows an attacker to essentially gain extra voting power, due to the ability to deter the opposing voters.

Larger impact consideration:

You mentioned griefing can be valid "based on the larger impact it has". Influencing governance decisions, controlling the entire protocol, is one of the largest impacts possible, potentially affecting millions in protocol funds. This aligns with this criteria for validity.

In this respect Scenario 3 is not griefing, but a direct, and profitable loss of protocol funds.

Loss of funds rule interpretation:

The rules don't explicitly exclude gas fees as a valid loss. It is arbitrary to decide this isn't a "loss of funds" when users are losing real ETH due to a vulnerability. From a user's perspective, this is a financial loss due to a protocol vulnerability - allowing an attacker to block their transactions. This is not just "how blockchain works" if the specific vulnerability allows the loss.

For example, in the Elfi findings you mentioned, there is no difference between a keeper losing gas fees due to a revert (like in this report), or them being not sufficiently refunded for paying gas fees. It is an arbitrary distinction - like saying that "-10" is a loss, but "0-10" is not a loss:

The result for the keeper in Elfi is that they paid gas fees, and received no refund, so lost those fees.
The result for the voter in MakerDAO is that they paid gas fees and got no vote, so lost those fees.

It is the exact same loss of funds, the only difference is that gas is either lost with a revert or without a revert.

Your interpretation of loss of funds being excluded is not consistent with all of:

Logical equivalence to your own reasoning differentiating lack of refund from loss (shown above).
Rules - there's no exclusion. Which explains the prevalence of so many accepted issues.
Previous judging decisions, including by yourself, but not only.

Again, I cite previous accepted findings for this loss not as sources of truth, but as evidence for the argument that your current interpretation of the rules is inconsistent with the rules and how they are interpreted by other watsons and judges (including yourself).

This distinction between gas fees lost to revert due to a bug and gas fees "not refunded" due to a bug is arbitrary, and is not even semantic - it's 100% same in every way.

I respectfully request a reconsideration based on these arguments. Thank you for your time.

WangSecurity commented 3 weeks ago

While the voting period is longer, users do not expect to be blocked, so will vote at different times during the period - this is the assumption. Users cannot be assumed to vote ONLY before the last minute. Thus, a 1m12s block at the end of voting is time-sensitive, and could change the outcome and cause a large loss to the protocol - because those users cannot re-vote. In governance, last minute votes can be crucial, especially for close decisions - this is the essence of time-sensitivity. The impact is that the bug causes time-sensitive voting to be maliciously manipulated, without the ability to re-vote.

Still, my decision remains. If the voting is DOSed for 1 minute 12 seconds out of several hours of voting, it's not time-sensitive, even if it's the last minute, it's not time-sensitive.

While each instance is short, the attack is repeated throughout the voting period, leading to cumulative impact. This could significantly sway voting results, especially in close decisions, as shown in the attacks on Compound, Balancer, and SushiSwap. The repeatability and the effect on many voters shouldn't be dismissed as it's key to the increased impact of Scenario 3. The users deterred by the asymmetric (targeted) blocking - allow one side to maliciously influence the voting. This allows an attacker to essentially gain extra voting power, due to the ability to deter the opposing voters.

The impact of this attack is DOS of voting, for this to have a large impact, the attack has to be repeated perpetually. Hence, we should consider one instance of the attack based on the rules I've already quoted above, i.e. we've already discussed the repeatability above.

You mentioned griefing can be valid "based on the larger impact it has". Influencing governance decisions, controlling the entire protocol, is one of the largest impacts possible, potentially affecting millions in protocol funds. This aligns with this criteria for validity. In this respect Scenario 3 is not griefing, but a direct, and profitable loss of protocol funds.

The above paragraph about repeatability also applies here. With one instance we have only 1 minute and 12 second DOS of voting, which I believe doesn't influence the governance decisions or control the entire protocol because even the emergency voting will last for at least several hours. Moreover, I don't see where scenario 3 in the report mentions a loss of funds and DOSing the voting doesn't directly lead to a loss of funds, especially, when the DOS is 1m12s long.

Yes, it's not directly said the loss of gas fees is excluded, but the decision is still that the loss of gas fees is low severity because all the funds of the user remain untouched. Paying gas for a failed transaction is how the blockchain works even if the user is forced to. Also, making the user pay for 2 transactions instead of 1 is not a medium-severity loss of funds.

About comparison with Elfi, it's the last time that I respond to these arguments because historical decisions are not sources of truth as said in our rules and have no effect on the judging outcome.

For example, in the Elfi findings you mentioned, there is no difference between a keeper losing gas fees due to a revert (like in this report), or them being not sufficiently refunded for paying gas fees. It is an arbitrary distinction - like saying that "-10" is a loss, but "0-10" is not a loss

That's the reason why the historical decisions are not sources of truth. Elfi and Maker are completely different protocols, with completely different designs, contracts, functionalities, etc. That's why "similar" issues may be judged differently in different contests. Elfi's design was to refund the gas paid by the keeper for completing the transaction, i.e. keeper made a transaction and they must receive back the gas fee they paid, but they didn't. That's the reason why it was valid. I don't see how it's similar to the issue here, they are completely different. Hence, the following statement is incorrect:

It is the exact same loss of funds, the only difference is that gas is either lost with a revert or without a revert

Logical equivalence to your own reasoning differentiating lack of refund from loss (shown above).

Hope the above answers how this line is irrelevant here and why the issues on Elfi were judged the way they were.

Rules - there's no exclusion. Which explains the prevalence of so many accepted issues

Note that based on the rules not every loss of funds warrants M/H severity.

Previous judging decisions, including by yourself, but not only

These are not sources of truth. Above, I explained why on Elfi they were valid. Also, I think you missed the part of my previous comment. The only issues from Elfi that you mentioned and I judged didn't mention loss of gas fees in any place and the impact wasn't related to gas fees in any way.

Again, I cite previous accepted findings for this loss not as sources of truth, but as evidence for the argument that your current interpretation of the rules is inconsistent with the rules and how they are interpreted by other watsons and judges (including yourself).

Again, the protocols have different designs. The current interpretation is consistent with the rules and how they were always interpreted. The issues on Elfi are valid NOT because it was a loss of gas for a failed transaction as in this case.

This distinction between gas fees lost to revert due to a bug and gas fees "not refunded" due to a bug is arbitrary, and is not even semantic - it's 100% same in every way.

It's not even close to being "100% same in every way". Elfi protocol had a broken functionality which resulted in users not getting the funds they must have got. Here, it's nothing close to it.

Again, the decisions on any other contest are not a source of truth and don't affect judging on any of the other contests. Refrain from using it as an argument. But, even if you do, please check the designs of the protocol you want to reference and check how they should and shouldn't work. The rule "Design decisions are not a source of truth" is implemented due to this exact reason of different codebases having different designs, hence, on one contest it's a valid issue, on the other it's an invalid issue and on the third, it's a design decision.

The decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 3 weeks ago

@WangSecurity

Thank you for your response. However, I find that it largely repeats earlier claims without addressing the core counter-arguments I presented or adding any supporting evidence to your claims. Specifically:

Time-sensitivity: You state, "If the voting is DOSed for 1 minute 12 seconds out of several hours of voting, it's not time-sensitive, even if it's the last minute, it's not time-sensitive." This assertion only repeats the claim. It also contradicts itself - if a crucial last minute vote is blocked, and changes the outcome, how is the impact not time-sensitive?
Cumulative impact: The response doesn't address the cumulative effect of repeated attacks throughout the voting period. You state: "...for this to have a large impact, the attack has to be repeated perpetually." You ignore the argument that the cumulative effect of attacking many voters, some of whom won't re-vote due to time-sensitivity, and others due to the additional cost, is that the attacker has a much higher chance to win the vote.
Governance manipulation and loss of funds: The potential for an attacker to gain "extra voting power" by selectively deterring voters is not addressed, despite its impact on voting outcomes. You state: "Moreover, I don't see where scenario 3 in the report mentions a loss of funds", but in the report this is mentioned "increasing the chance of the attacker influencing the vote", and also "swaying governance decisions", and "leading to more vulnerable governance". The loss of governance control directly translates loss of funds, just like gaining the private key to an EOA directly translates to loss of funds, without the need to spell it out in either cases.
Loss of funds interpretation: The distinction between gas fees lost due to this vulnerability and other financial losses remains arbitrary and unsupported by you. You state: "Yes, it's not directly said the loss of gas fees is excluded, but the decision is still that the loss of gas fees is low severity because all the funds of the user remain untouched." You agree that it's not actually excluded, and then contradict yourself by downgrading the severity. It is a loss of funds by the rules (since not excluded), and the amount satisfies both the 0.5% threshold and even the 5% threshold as shown in the POC. There is no justification to downgrade this loss of funds. Additionally, "all the funds of the user remain untouched" is wrong, since if they succeed to re-vote they must pay additional ETH, more than 5% of their total value, as shown.
Misinterpretation of argumentation based on prior findings: first, you keep mentioning "not source of truth", which I also made sure to say every single time I mentioned them - there is no reason to keep mentioning this. I do not bring those up as sources of truth, but as examples of the application of the rules. You state: "Elfi and Maker are completely different protocols, with completely different designs, contracts, functionalities, etc. That's why "similar" issues may be judged differently in different contests." However, unless changed in the README or other sources of truth, the rules are the same rules. There is nothing I see in Elfi README that changes the Sherlock rules for loss funds due to gas in any way. Thus, you unfairly ignore the fact that loss of ETH for a keeper and loss of ETH to the voter are same. Not receiving a refund (-X in ETH), or having to pay for another transaction due to attack (-X in ETH) are the same loss.
Misapplication of design decision rule: you state: "on one contest it's a valid issue, on the other it's an invalid issue and on the third, it's a design decision". Again this contradicts the rules - if it is a design decision, the difference should be mentioned (or obvious) in the sources of truth - nowhere is it mentioned that users losing ETH due to malicious MMEV griefing is an accepted loss by design, or anything that can be interpreted this way.

These points align with medium severity criteria for three distinct vulnerabilities of medium severity:

Loss of funds for the users (gas fees)
Potential loss of funds for the protocol (governance manipulation)
Time-sensitive DoS.

I respectfully request to engage with my arguments and to provide support for yours, instead of repeating the disputed claims. In my opinion, if no such arguments can be found, it is highly likely that your conviction is misplaced, and the issue should be accepted based on either or all of the three distinct categories.

WangSecurity commented 3 weeks ago

Thank you for your response. However, I find that it largely repeats earlier claims without addressing the core counter-arguments I presented or adding any supporting evidence to your claims

However, you also repeat the very same arguments several times. Specifically:

Time-sensitivity: You state, "If the voting is DOSed for 1 minute 12 seconds out of several hours of voting, it's not time-sensitive, even if it's the last minute, it's not time-sensitive." This assertion only repeats the claim. It also contradicts itself - if a crucial last minute vote is blocked, and changes the outcome, how is the impact not time-sensitive?

In this, this, this, this comments you've stated the same idea that the last-minute is time-sensitive, which I disagree with and I believe any one-minute DOS of several-hour voting is not time-sensitive. I repeat it because you repeat the same idea each time. Unfortunately, it's the decision rather than a debatable question.

Cumulative impact: The response doesn't address the cumulative effect of repeated attacks throughout the voting period. You state: "...for this to have a large impact, the attack has to be repeated perpetually." You ignore the argument that the cumulative effect of attacking many voters, some of whom won't re-vote due to time-sensitivity, and others due to the additional cost, is that the attacker has a much higher chance to win the vote.

I believe that blocking any user for 1m12s is not time-sensitive, as stated above, even if it's last minute. About the users not voting if their transaction failed once is a speculation on how the user would behave. If you have a large voting power and your decision can influence the voting, you'll be targeted. But, if you have such high voting power and you don't like the current state of the proposal, why would you decide not to vote after one of your transactions failed? It's also a speculation. Hence, I believe it would be fair if both of us restrained from speculation and used concrete arguments.

Governance manipulation and loss of funds: The potential for an attacker to gain "extra voting power" by selectively deterring voters is not addressed, despite its impact on voting outcomes. You state: "Moreover, I don't see where scenario 3 in the report mentions a loss of funds", but in the report this is mentioned "increasing the chance of the attacker influencing the vote", and also "swaying governance decisions", and "leading to more vulnerable governance". The loss of governance control directly translates loss of funds, just like gaining the private key to an EOA directly translates to loss of funds, without the need to spell it out in either cases.

I believe the first part about "selectively deterring voters" is partially answered in the above paragraphs. DOSing voters for 1m12s doesn't guarantee (almost) any extra voting power, it's not time-sensitive in the context of evem emergency votings taking at least several hours. Excuse me, but influencing governance decisions and gaining a private key for the EOA is nowhere near a good comparison. Firstly, this attack of blocking voters for 1m12s doesn't guarantee any impact at all or I believe has a very low impact. Excuse me, but for example, if we have a 4-hour (arbitrary example) vote and the attack blocks voter(s) for 1m12s but they have 3h58m48s to freely vote (assuming one instance of the attack), I cannot agree that it can lead to any impact besides some fractions of a per cent. Moreover, not all governance decisions directly impact any funds.

Loss of funds interpretation: The distinction between gas fees lost due to this vulnerability and other financial losses remains arbitrary and unsupported by you. You state: "Yes, it's not directly said the loss of gas fees is excluded, but the decision is still that the loss of gas fees is low severity because all the funds of the user remain untouched." You agree that it's not actually excluded, and then contradict yourself by downgrading the severity. It is a loss of funds by the rules (since not excluded), and the amount satisfies both the 0.5% threshold and even the 5% threshold as shown in the POC. There is no justification to downgrade this loss of funds. Additionally, "all the funds of the user remain untouched" is wrong, since if they succeed to re-vote they must pay additional ETH, more than 5% of their total value, as shown.

Here, again, I can quote the very same comments where you use the same arguments that loss of gas is medium severity, basically repeating yourself. Excuse me, but I never treated loss of gas due to griefed transaction as an M/H severity as well as Sherlock in general. Even though the user has to pay gas for 2 transactions instead of 1, is not considered an M/H severity as it always has been. Griefing for gas (as we quoted several times) is valid only if it's clearly time-sensitive, which I've already said several times is not the case here. Unfortunately, it's the decision and not a debatable question.

Misinterpretation of argumentation based on prior findings: first, you keep mentioning "not source of truth", which I also made sure to say every single time I mentioned them - there is no reason to keep mentioning this. I do not bring those up as sources of truth, but as examples of the application of the rules. You state: "Elfi and Maker are completely different protocols, with completely different designs, contracts, functionalities, etc. That's why "similar" issues may be judged differently in different contests." However, unless changed in the README or other sources of truth, the rules are the same rules. There is nothing I see in Elfi README that changes the Sherlock rules for loss funds due to gas in any way. Thus, you unfairly ignore the fact that loss of ETH for a keeper and loss of ETH to the voter are same. Not receiving a refund (-X in ETH), or having to pay for another transaction due to attack (-X in ETH) are the same loss.

Yes, I keep mentioning that it's not a sources of truth and doesn't affect anything because you continue to use them as arguments to change my decision here. Moreover, it seems like you're not trying to hear my arguments in take into account that these are completely different codebases with completely different functionalities. Hence, the issues are different and you're not trying to get any context. I've explained it in my previous comments and you just repeat the same arguments you've had stated previously.

Misapplication of design decision rule: you state: "on one contest it's a valid issue, on the other it's an invalid issue and on the third, it's a design decision". Again this contradicts the rules - if it is a design decision, the difference should be mentioned (or obvious) in the sources of truth - nowhere is it mentioned that users losing ETH due to malicious MMEV griefing is an accepted loss by design, or anything that can be interpreted this way.

Unfortunately, the README is not a place where the protocol has to document the entire protocol and each functionality. Hence, even if something is not stated in the README, it doesn't mean there are no intended design and design decisions. That's where getting context on the codebase is necessary to understand what the intended design is.

These points align with medium severity criteria for three distinct vulnerabilities of medium severity

Again, I disagree it's medium severity:

Loss of funds for the users (gas fees)

Lost gas fees due to a griefed transaction are not a valid loss of funds and never were.

Potential loss of funds for the protocol (governance manipulation)

I disagree this DOS of 1m12s is sufficient to have a governance manipulation.

Time-sensitive DoS

It's not a time-sensitive DOS.

Repeating to finally settle the discussion, lost gas fees due to griefed transactions and this situation not being time-sensitive is the decision, not a debatable question. Also, before accusing me of repeating the same arguments, please refrain from repeating your arguments as well. The decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 3 weeks ago

@WangSecurity

Also, before accusing me of repeating the same arguments, please refrain from repeating your arguments as well.

I apologise for causing you this frustration. The problem I was trying to resolve was repeating claims (as opposed to arguments), and not addressing or countering my arguments. Which is why I felt I had to repeat them, to try to get to why you disagree and how it aligns with the rules. See this pattern that I was seeing:

You: claim A
- I counter: A is wrong because it is against rule B and because of logic C
  - You repeat: I still think claim A.
    - I see no option other than to repeat: you ignored my argument regarding B and C, can you please address it? Then I try to restate B and C more clearly.
      - We both think the other is just repeating.

However, I'd still like to clarify one part, that appears to be an important misunderstanding.

About the users not voting if their transaction failed once is a speculation on how the user would behave. If you have a large voting power and your decision can influence the voting, you'll be targeted. But, if you have such high voting power and you don't like the current state of the proposal, why would you decide not to vote after one of your transactions failed?

I am not assuming only high voting-power users are blocked. This is a misunderstanding of the cumulative effect - the source of the "extra voting power":

I assume non-negligible voting power is distributed between many users.
Yes, if Rune is blocked, he will re-vote, because he knows it's crucial. I agree.
But as I described above, imagine 1000 users, with varying stake amounts.
Everyone BUT Rune will consider themselves non-crucial, and so may be deterred. NOT on the basis of speculation, but on the assumption of rational behaviour.
The user has to pay X for voting (gas fees, time). Unexpected blocking forced that cost to 2X for all voters opposing the attacker. If they are rational, some will not-revote, having "spent" their X, unless willing to spend 2X. Clearly every rational user has their limit, for some it's below 2X.
Additionally, depending on various factors, some fraction of the users will vote near the end. It is not a uniform distribution for the period, it is more towards the end. This is not speculation, this is just typical crowd dynamics fact of life. These are the last minute voters, but their amount, and not the time is what makes the difference.

For the protocol's loss of funds, the amount of "extra voting power" the attacker gains, is not time dependent in this case. It depends on the fraction of the users that are deterred from re-voting with blocking + the fraction of voters that vote in the last minute and so cannot physically re-vote. If 1000 users vote, and 300 are deterred + blocked, if together they control 10% of the relevant voting power - this is a highly significant fraction for the many close-call governance attacks I cited.

Looking at criteria for Medium again (not a High!), I think it's fully justified:

Is this intended design? I can't see how.
Can this cause a "loss of funds but requires certain external conditions or specific states". I think yes. Are the constraints implausible or incredibly rare? I don't think so: high gas prices and non-whale users participating in governance are the constraints.
Does it satisfy the loss amount criteria? A governance vote has no impact limit, so yes.
Is this impact claimed in the report. Yes, as shown, multiple times.
Is time-sensitivity a criteria here? No, it's not the griefing impact, it's the loss impact (attacker profits).
Is gas fees as loss required? No, the loss due to the governance attack taking control of all or some protocol controlled funds or contracts.

WangSecurity commented 3 weeks ago

You: claim A I counter: A is wrong because it is against rule B and because of logic C You repeat: I still think claim A. I see no option other than to repeat: you ignored my argument regarding B and C, can you please address it? Then I try to restate B and C more clearly. We both think the other is just repeating.

Oh I see, excuse me, then yep highlight what hasn’t been answered clarified.

I am not assuming only high voting-power users are blocked. This is a misunderstanding of the cumulative effect - the source of the "extra voting power"

I understand it’s not exactly what I meant by targeting specific voters. I was just trying to show that even if one user is targeted, why they would give up their vote after 1m12s DOS. And I mentioned high-voting power users, because they’re better to DOS, i.e. it’s better to DOS someone with 100 voting power than someone with 1.

• The user has to pay X for voting (gas fees, time). Unexpected blocking forced that cost to 2X for all voters opposing the attacker. If they are rational, some will not-revote, having "spent" their X, unless willing to spend 2X. Clearly every rational user has their limit, for some it's below 2X.

Yep, it’s fair to say that someone won’t vote in that case, but it would a speculation to say who exactly and how many people. Hence, we cannot quantify the effect here, even though I agree it’s fair to assume someone won’t vote.

• Additionally, depending on various factors, some fraction of the users will vote near the end. It is not a uniform distribution for the period, it is more towards the end. This is not speculation, this is just typical crowd dynamics fact of life. These are the last minute voters, but their amount, and not the time is what makes the difference.

Yep, I agree that there will be voting at the very end. Excuse me if my previous messages sounded like there won’t be such a case. The problem is that even in that case I consider “voting” as time-sensitive. It sounds like a repeat myself, and I’ll try to elaborate on why I do it. It’s my decision “it’s not time-sensitive” and you bring new arguments, but my decision remains as it is, hence, I repeat myself, because this is still my decision. Hope that makes sense.

For the protocol's loss of funds, the amount of "extra voting power" the attacker gains, is not time dependent in this case. It depends on the fraction of the users that are deterred from re-voting with blocking + the fraction of voters that vote in the last minute and so cannot physically re-vote. If 1000 users vote, and 300 are deterred + blocked, if together they control 10% of the relevant voting power - this is a highly significant fraction for the many close-call governance attacks I cited.

Yes, I understand, but I still stand by my initial decision that the DOS of 1m12s for several-hour voting, even for the last-minute, is not time-sensitive and low-severity.

1 Is this intended design? I can't see how.

Excuse me if my messages seemed to claim its design decision, I don’t.

1 Can this cause a "loss of funds but requires certain external conditions or specific states". I think yes. Are the constraints implausible or incredibly rare? I don't think so: high gas prices and non-whale users participating in governance are the constraints. 2 Does it satisfy the loss amount criteria? A governance vote has no impact limit, so yes.

Still, I believe the DOS of several-hour voting for a little longer than a minute doesn’t pose medium-severity outcome and it’s my decision.

1 Is this impact claimed in the report. Yes, as shown, multiple times.

Agree

1 Is time-sensitivity a criteria here? No, it's notthe griefing impact, it's the loss impact (attacker profits).

Excuse me, then it was the misunderstanding from my side and I thought you claim it to be time-sensitive.

1 Is gas fees as loss required? No, the loss due to the governance attack taking control of all or some protocol controlled funds or contracts.

Again, my thought was that you claim the loss of gas fees as medium impact.

Still, my decision remains as it is, I believe DOSing at least several-hour voting for 1m12s doesn’t pose a Medium severity impact on the protocol and should remain as it is. Planning to reject the escalation.

Note: I didn’t quote and respond a couple of your paragraphs, not because I overlooked or don’t want to address them. I don’t have anything to add to them and they look like a scenario explaining a possible outcome. It’s viable and I don’t say “it won’t happen”.

mrhudson890 commented 3 weeks ago

@WangSecurity thank you, I think we have a much better shared understanding now. However, I think there was confusion about my claim and which impact I was arguing for.

Previously I was claiming all three impacts:

Time-sensitive griefing (DoS)
Loss of funds for users due to gas
Loss of funds for protocol due to voting manipulation - Scenario 3 (Humpy)

In my previous comment I've stopped arguing for 1 and 2 due to our disagreement about time sensitivity and gas fee loss. I focused only on impact 3 - medium severity loss of funds for the protocol, for which we don't need to agree on time-sensitivity, or gas fees loss. This is because for impact 3, time-sensitivity is not a validity criteria (it's not DoS or griefing) - it's a profitable attack.

Excuse me, then it was the misunderstanding from my side and I thought you claim it to be time-sensitive. .. Again, my thought was that you claim the loss of gas fees as medium impact.

I think that quote shows it did become clear only at the end of my comment. However, the misunderstanding made most of your responses before that irrelevant. You responded to most of my arguments in terms of validity of DOS and time-sensitivity, instead of loss of funds:

WangSecurity:

The problem is that even in that case I consider “voting” as time-sensitive. .. It’s my decision “it’s not time-sensitive” and you bring new arguments, but my decision remains as it is .. I still stand by my initial decision that the DOS of 1m12s for several-hour voting, even for the last-minute, is not time-sensitive and low-severity. ... Still, I believe the DOS of several-hour voting for a little longer than a minute doesn’t pose medium-severity outcome and it’s my decision.

Finally, in the context of this mismatch, your last paragraph, reiterating your decision, seems incorrect:

Still, my decision remains as it is, I believe DOSing at least several-hour voting for 1m12s doesn’t pose a Medium severity impact on the protocol and should remain as it is. Planning to reject the escalation.

I'll now address the comments that are relevant to that impact, add data, and restate the case for clarity.

You agree with the following scenario arguments:

my comment: - The user has to pay X for voting (gas fees, time). Unexpected blocking forced that cost to 2X for all voters opposing the attacker. If they are rational, some will not-revote, having "spent" their X, unless willing to spend 2X. Clearly every rational user has their limit, for some it's below 2X.

WangSecurity: Yep, it’s fair to say that someone won’t vote in that case, ... I agree it’s fair to assume someone won’t vote

my comment: - Additionally, depending on various factors, some fraction of the users will vote near the end. It is not a uniform distribution for the period, it is more towards the end. This is not speculation, this is just typical crowd dynamics fact of life. These are the last minute voters, but their amount, and not the time is what makes the difference.

WangSecurity: Yep, I agree that there will be voting at the very end

The only disagreement I see:

.. it would a speculation to say who exactly and how many people. Hence, we cannot quantify the effect here

To address this, I'll both quantify the effect, and quote some relevant available numbers.

The gained "extra voting power":

n users with N voting power are blocked because they voted at the end
m users with M voting power were deterred from re-voting a second time
A is the attacker's voting power, and O is opposition's total voting power
without the vulnerability, attacker wins if A > O
with the vulnerability, attacker wins if A > O - M - N.
This is equivalent to A + M + N > O. In other words, the attacker "gains" the voting power N + M (matching the total blocked users).

How large do N + M need to be for flipping the result?

For example, in Compound's Humpy case all 3 attacks (https://rekt.news/the-humpy-dance/) had vastly different A: 95K, 118K, 682K, and O: 710K, 578K, 633K. In the final (successful) attack, the difference was only 7% in voting power. This shows that even single digit percent differences have been sufficient in the past.
In Maker's specific case, we can see this delegate , which has accounts with 2.4K to 5.5K MKR (Metrics page). These shows that relatively small accounts participate. If blocked, such accounts may be a part of M if vote early, and part of N if vote late. In past votes (with ~113K MKR total), each such account would add roughly 2%-5% to the attacker's voting power. Cumulatively the gained voting power can be easily more than 10%.
Other smaller examples: 2.7K, 900, 500 are examples that again show that users with small, but cumulatively significant voting power, also participate.

The validity criteria for medium loss of funds:

"loss of funds but requires certain external conditions or specific states". I think the arguments above satisfy this criteria. Under plausible demonstrated conditions: high gas fees, non-whale users, and a close vote, the attacker gains a malicious advantage allowing them to pass a malicious proposal.
Does it satisfy the loss amount criteria of 0.5%? Since protocol's reserves are assumed at 100M, even a proposal that moves just 250 MKR would satisfy the 0.5% threshold required for medium severity. Taking over the protocol is possible but harder, but confusing proposals siphoning smaller amounts of funds are highly likely, as shown in persistent and repeated Humpy attacks.

For these reasons, I think the validity criteria for this impact, at medium severity, are thoroughly satisfied.

WangSecurity commented 2 weeks ago

Thank you for this thorough analysis and excuse me that I didn't reflect on the scenario 3 clear enough.

n users with N voting power are blocked because they voted at the end m users with M voting power were deterred from re-voting a second time A is the attacker's voting power, and O is opposition's total voting power without the vulnerability, attacker wins if A > O with the vulnerability, attacker wins if A > O - M - N. This is equivalent to A + M + N > O. In other words, the attacker "gains" the voting power N + M (matching the total blocked users).

That's fair, but we still have to remember that M is not completely blocked from Voting and they can freely vote all the other time out of the DOS.

For example, in Compound's Humpy case all 3 attacks (https://rekt.news/the-humpy-dance/) had vastly different A: 95K, 118K, 682K, and O: 710K, 578K, 633K. In the final (successful) attack, the difference was only 7% in voting power. This shows that even single digit percent differences have been sufficient in the past.

But in case of Humpy, the page you linked several times doesn't mention the scenario from this report. In case of humpy it's (simply put) is a whale that can acquire large share of the governance token and influence the voting. This itself is not an issue and a trade-off of governance system as a whole. I think we both understand this part, I just wanted to make it clear.

As I understand, you use this "attack" to amplify yours that this whale can even DOS other voters and get even more advantage. Also, some votes have a very little number of holders competing. And for these small votes, you could be just a whale (as Humpy) and you would be free to influence the final decision.

As for points about it being a Medium severity issue, I see that basically getting the control of the voting can easily make a large loss of funds. The problem is that from my point of view, the DOS of 1m12s of even several-hour voting, i.e. the way that the attacker gains the additional voting power, is not sufficient to gain the control of that vote and influence the voting outcome. Hence, I believe it's low-severity because it allows to only DOS voting for 1m12s, which is basically no impact if it's not the last minute, and the user can freely vote after it, and even if it's the last-minute voting, unfortunately, I still don't believe it's sufficient to influence and gain any control of the governance.

Hence, the decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 2 weeks ago

M is not completely blocked from Voting and they can freely vote all the other time out of the DOS.

Sorry for repeating, but the DOS itself is not important for this impact. The fact that users have the ability, but will rationally not all use it doesn't help the protocol's loss. We already agreed on that I think:

WangSecurity: Yep, it’s fair to say that someone won’t vote in that case I agree it’s fair to assume someone won’t vote.

M are users that will be rationally priced out of voting a second time, and it's only important in so far as it impacts the protocol's loss, and not the users' DOS.

But in case of Humpy, the page you linked several times doesn't mention the scenario from this report

I think you misunderstand what I'm establishing with citing Humpy's attacks. I'm establishing the likelihood of the preconditions of my scenario:

That close-call governance attacks are common. In those even single digit percent differences in voting power are crucial for the protocol.
That confusing proposals causing limited but large losses are used, such that voters are not strongly compelled to act. 24M loss in that case, but for 0.5% loss for medium severity in this contest a mere 0.5M is required.
That these attacks are repeated (therefore common), because the attacker pays little cost by owning the tokens needed (since they can sell them).
That opposition's voting power is highly variable. Again making close call votes probable.

a whale that can acquire large share of the governance token and influence the voting

Crucially, buying voting power is not an attack. Voting power gained by buying tokens is 100% legitimate. However, voting power gained by blocking opposing votes, is an attack. Again, the DOS itself is NOT the impact. Only its cumulative effect on the vote - the maliciously gained voting power.

As for points about it being a Medium severity issue, I see that basically getting the control of the voting can easily make a large loss of funds.

Thank you. I think this is the most important point.

.. is not sufficient to gain the control of that vote and influence the voting outcome

This contradicts the evidence I cited above in this and previous comments:

Gained voting power (M + N).
Even small but malicious gains being crucial in some (not theoretical!) attacks.
Proof of participation of sufficient smaller voters with cumulatively significant voting power in Maker governance. These are likely to be part of M (not re-voting due to being rationally priced out and not rationally considering the additional waste non-crucial). Our previous agreement on the fact that such users will exist and M is not 0 in many cases.
Our common agreement that last minutes voters are plausible, and will be part of N and that N is not 0 in many cases.

In my understanding, the bar for medium in this case is:

Specific plausible preconditions.
Specific loss amount threshold.

@WangSecurity since both criteria are satisfied, I think your reluctance to accept as medium severity is not a correct application of the rules.

The attack allows otherwise impossible and illegitimate voting outcome manipulation, that under constraints that are shown to repeatedly occur, will result in protocol losses above the 0.5% threshold. I do not understand how this can be judged below medium according to the rules of the contest.

IllIllI000 commented 2 weeks ago

In my previous comment I've stopped arguing for 1 and 2 due to our disagreement about time sensitivity and gas fee loss. I focused only on impact 3 - medium severity loss of funds for the protocol, for which we don't need to agree on time-sensitivity, or gas fees loss. This is because for impact 3, time-sensitivity is not a validity criteria (it's not DoS or griefing) - it's a profitable attack.

Impact 3 is the same as 3.1.1 from https://github.com/sherlock-audit/2024-06-makerdao-endgame/blob/main/vote-delegate/audit/20240703-cantina-report-maker-vote-delegate.pdf which is a Low, where it says An opposing vote delegate might be incentivized to block a certain percentage of locks to a vote delegate with opposing views and is thus a known and acknowledged issue, and is invalid for this contest

Note:

All previous audits content should be taken into account for known issues.

https://github.com/makerdao/sherlock-contest/blob/9a01337e8f82acdf699a5c1c54233636c640ca89/README.md?plain=1#L9

WangSecurity commented 2 weeks ago

@IllIllI000 thank you very much, as I understand, not only impact is the same, the issue is the same. Both this and 3.1.1 from Cantina report talk about the attacker calling reserveHatch to block users from calling lock, basically preventing from voting for the opposing vote delegate.

Hence, it's not only low-severity, but also a known issue from the previous audit. The decision remains the same, reject the escalation and leave the issue as it is.

@mrhudson890 if you want to further discuss this issue and me to elaborate on your last comment, contact me on Discord, since it's a known issue and cannot be accepted here.

mrhudson890 commented 2 weeks ago

@IllIllI000 That issue describes a completely different scenario and impact - statistically spamming reserveHatche on an unbounded amount of VDs, for possibly a prolonged period of time in the hopes of blocking some users, in an ineffective way.

It is indeed low severity:

Its has an enormous cost, making it highly unlikely (blocking all opposing VDs for a long period of time).
The attack is completely ineffective, since once the attacker starts blocking a user's VD, the user will not submit a reverting transaction (their wallet will prevent them), and so it doesn't increase the cost for the users.
No ability for the attacker to know whether they're wasting their spam on delegations with negligible voting power, or no incoming delegations at all.

The scenario I'm describing allows targeted blocking of only the specific transactions, via MMEV, that users have already signed and sent:

Targeting via MMEV allows this attack to be effective and cheap - only a specific user's VD, at the specific time of their delegation, only for users with non-negligible amount of voting power.
The user loses their gas fees, causing the effect of pricing out some voters, making the attack effective in that regard.

Not only are impacts different, the root cause and mitigations are different:

Root cause difference: in my issue - the problem is when the hatch starts - next block - allowing MMEV blocking. In their issue it doesn't matter when the hatch starts, and their root cause is the general mechanism (which I don't propose to change).
Mitigation difference: I offer concrete mitigation - moving the hatch later in the cycle. There is no mitigation offerred in that issue.

@WangSecurity So while the issues sound related, they don't share a root cause, impact, or mitigation, therefore that issue cannot invalidate this issue. Another way to see it is that if my issue is mitigated - their issue remains unchanged.

I am hoping you will respond also to my previous comment that addresses your latest arguments and decision on severity.

WangSecurity commented 2 weeks ago

I believe in both cases the root cause is the same, which is calling reserveHatch blocks the subsequent calls to lock for 5 blocks or HATCH_SIZE. Hence, I believe it's a known issue and should be invalid. The decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 2 weeks ago

@WangSecurity I'm sad to see that you have ignored the arguments entirely, and only reiterated the claim these arguments refuted. Do you find my arguments incorrect or irrelevant? If so, why?

I believe in both cases the root cause is the same, which is calling reserveHatch blocks the subsequent calls to lock for 5 blocks or HATCH_SIZE.

This is not the root cause in 3.1.1. These two findings have different fixable flaws (root cause) they describe and recommend to fix.

3.1.1 does not care about the timing of the hatch window within the cycle. Instead it recommends: "A reasonable HATCH_SIZE should be chosen that gives liquidators enough time to perform the liquidation, as well as a reasonable HATCH_COOLDOWN such that the lock-blocking ratio remains low while still allowing frequent liquidations". Their root cause is HATCH_SIZE to HATCH_COOLDOWN ratio, they only point out the spam attack blocks a certain ratio (20%), their "flaw" is that this ratio should not be too high. Their scenario is statistical spam, which is not timed in any way relatively to the users' actions: "Assuming locks are randomly distributed over time", which means that the timing of the hatch within the cycle makes no difference in that issue.

In contrast, my finding's recommendation is "introduce a longer delay before the hatch takes effect. Instead of starting from the next block, it should start after M blocks, where M is a governance-controlled parameter.". This finding's root cause is that the hatch window starts immediately, I do not recommend any changes to HATCH_SIZE or HATCH_COOLDOWN. My finding's flaw is the timing, not the ratio.

Another way to see the findings do not share a root cause is that if mitigated, the the two findings are entirely independent from one another:

One can choose any HATCH_SIZE and HATCH_COOLDOWN (3.1.1 recommendation), and have my finding remain just as exploitable - because the hatch starts on block 1 always.
One can move the start of the hatch to the middle, fixing the MMEV issue entirely, and have 3.1.1 be exactly the same - because the exact same percentage of blocks is blocked as before.

If mitigation of one finding does not mitigate the other - they do not have the same root cause.

The additional arguments for why these findings are independent (not duplicates), as explained in the comment above that are also important:

Different scenarios for exploitation:
- 3.1.1: spamming an unbounded amount of contracts, for prolonged time, BEFORE any user calls them.
- this issue: blocking specific transactions, for specific users, AFTER they have submitted them.
Different costs: 3.1.1 - enormous costs, this issue - minimal cost.
Different protocol impact: 3.1.1 - no impact shown, this issue - ability to gain extra voting power cheaply and effectively.
Different user impact:
- 3.1.1 - user needs to wait in 20% of cases and will not waste any gas (wallet will fail to estimate gas).
- this issue - user's gas is wasted in 100% of cases, they need to wait in 100% of cases, they need to submit another transaction.

This report is not a known issue: different root cause, mitigation, scenarios, impacts.

WangSecurity commented 2 weeks ago

I still believe they have the same root cause. If we look at the description of the Cantina's finding:

When calling reserveHatch, lock calls happening within the next HATCH_SIZE blocks will revert. (This functionality exists to mitigate a liquidation-blocking issue.) Assuming locks are randomly distributed over time, spamming reserveHatch as soon as the cooldown passed allows blocking HATCH - SIZE / (HATCH_COOLDOWN + HATCH_SIZE + 1) = 19.23% of all locks. An opposing vote delegate might be incentivized to block a certain percentage of locks to a vote delegate with opposing views.

I believe the issue has the exact root cause as I've said in the previous message:

calling reserveHatch blocks the subsequent calls to lock for 5 blocks or HATCH_SIZE.

And:

When calling reserveHatch, lock calls happening within the next HATCH_SIZE blocks will revert.

The latter part of the Cantina's finding description is the impact and how many locks would be blocked.

The root cause of this issue:

The VoteDelegate contract includes a reserveHatch (RH) mechanism that, when called, prevents locking (entering) for a set number of blocks (5), starting from the next block The LockstakeEngine uses this lock function in its _selectVoteDelegate internal method, that's used throughout selectVoteDelegate, lock, and lockNgt user facing functions This design allows an attacker to exploit the reserveHatch mechanism to prevent users from executing these functions, particularly during critical voting periods

It's not as short as my description of the root cause or Cantina's but I believe these 3 sentences from the report can be rephrased in my definition of the root cause:

calling reserveHatch blocks the subsequent calls to lock for 5 blocks or HATCH_SIZE

Hence, the two issues have the same root cause. Yes, they show different scenarios and thus different impact, but still the root cause is still the same. Hence, it's a known issue, planning to reject the escalation and leave the issue as it is.

IllIllI000 commented 2 weeks ago

This is not the root cause in 3.1.1. These two findings have different fixable flaws (root cause) they describe and recommend to fix.

Also note that the different fixable flaws that you've pointed to, according to Sherlock's definitions, are attack paths and not root causes. There can be multiple attack paths for the same root cause.

mrhudson890 commented 2 weeks ago

@WangSecurity You are confusing context description and root cause. You cite descriptions of the reserveHatch mechanism, and say that it is the root cause of both issues.

Both findings must describe the original reserveHatch mechanism - this is the mechanism they both identify flaws in. However, these flaws are different and independent from each other as I explained in the arguments above (that you haven't addressed yet).

While Sherlock's definition is a bit vague, it is clear that a root cause should be uniquely identifiable (there should be one, "root"), therefore "primary" or "fundamental" must mean the most specific of all.

Root Cause: The primary factor or a fundamental reason that imposes an unwanted outcome

Your version of the root cause neither is the most fundamental, nor uniquely matches the unwanted outcome in either of the findings.

In particular 3.1.1 discusses the primary factor of the of HATCH_SIZE to HATCH_COOLDOWN ratio, which imposes the unwanted outcome of being able to statistically block 20% of blocks. They propose to choose that ratio carefully, without altering anything else (like hatch delay). The fundamental factor and the outcome are clearly matching.

My finding discusses the primary factor of the hatch delay being too small (=1) after the call, leading to the unwanted outcome of being able to frontrun transactions via MMEV and cause them to fail after being submitted. I propose to fix it by moving the hatch start, without altering anything else (like size, cooldown, and total cycle length). The fundamental factor and the outcome are clearly matching.

From broad to specific properties of each described system:

3.1.1: Ethereum > Maker system > VodeDelegate contract > reserveHatch mechanism > ratio of hatch/cooldown property > hatch_size=5,hatch_cooldown=15 > ... It's clear that "Maker" is too broad, so is "reserveHatch mechanism", because while they cover the finding, they aren't the most specific. Similarly hatch_size=5 is too specific and doesn't cover the full issue. The only "primary" or "fundamental" factor, that is uniquely identifiable, and both captures and covers the unwanted outcomes is ratio of hatch/cooldown property.
My finding: Ethereum > Maker system > VodeDelegate contract > reserveHatch mechanism > hatch_delay > hatch_delay = 1 block > ... the most fundamental property is the hatch_delay in general. Anything above that is too broad (not unique, not primary), anything below that ("hatch_delay = 1 block") doesn't cover all variations (like MMEV of 3 blocks). This is why I recommend that parameter to be settable by admin.

This understanding of what can be primary or fundamental is equivalent to "fixable flaw" definition I used, which is more easily operationalized (used). Clearly, considering VoteDelagate contract as the flaw is too broad, and hatch_delay=1 is too narrow.

These two definitions result in the same conclusion, of the two findings' root causes being separate.

fixable flaws that you've pointed to, according to Sherlock's definitions, are attack paths

Attack Path: A sequence of steps or actions a malicious actor takes to cause losses or grief to the protocol or users and/or gain value or profit.

The flaws I point out are not attack paths, because you can't fix attack paths: In my finding the hatch delay is the flaw, and the fix. It is not the attack path - MMEV is the attack path - I do not propose to "fix" MMEV. In 3.1.1 the size/cooldown ratio is the flaw, and the recommendation. It is not an attack path - spamming RH on all VDs is the attack path - they do not propose to fix "ability to spamming".

WangSecurity commented 2 weeks ago

Thank you for this extensive description. But, I still believe what you describe is the attack paths and the actual root cause of both findings is what I've described in my previous https://github.com/sherlock-audit/2024-06-makerdao-endgame-judging/issues/95#issuecomment-2320662063 .

It's the primary factor of both issues, what you describe is different attack paths that lead to different outcomes. I see you've said I didn't address each of your points separately, because I don't see how it adds value to the discussion if I just quote your comment and say "it's incorrect, this is no the root cause, this is different scenarios triggered by the same root cause".

The decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 2 weeks ago

@WangSecurity

I still believe what you describe is the attack paths

what you describe is different attack paths that lead to different outcomes

You seem to have missed my whole response to the "the attack path" argument above.

Here it is again, in more detail. Sherlock's definitions:

Root Cause: The primary factor or a fundamental reason that imposes an unwanted outcome

Attack Path: A sequence of steps or actions a malicious actor takes to cause losses or grief to the protocol or users and/or gain value or profit.

Now please see what I say the root causes are:

3.1.1:

fundamental reason: HATCH_SIZE to HATCH_COOLDOWN ratio.
unwanted outcome: hatch can be closed for 20% of blocks.
attack path: spamming all VDs for indefinite time.
mitigation: set HATCH_SIZE to HATCH_COOLDOWN carefully, without changing anything else.

My finding:

fundamental reason: hatch DELAY is too short
unwanted outcome: allowing frontrunning via MMEV to block
attack path: frontrunning via MMEV
mitigation: increase hatch DELAY without changing anything else.

The unwanted outcome is the attack or related to it, but the root cause is the reason the outcome is possible. It's the "fundamental reason that imposes an unwanted outcome".

How can one possibly confuse the "bug" - part of logic or code, with the attack path - what the attacker does?

How can one possibly interpret "hatch delay is too short" to be "the attacker's actions"? It's in the code. And it's the only thing that allows my issue's attack path.

Or "hatch size / cooldown ratio" to be "the attacker's actions"? It's in the code. And it's the only thing that allows 3.1.1's attack path.

IllIllI000 commented 2 weeks ago

unwanted outcome: hatch can be closed for 20% of blocks.

There's nothing that says 20% of blocks not allowing locking is not wanted. That in and of itself is not a vulnerability. The unwanted outcome is being able to block other people's votes

unwanted outcome: allowing frontrunning via MMEV to block attack path: frontrunning via MMEV

Based on the above, you've written unwanted outcome: allowing <attack-path> to block [other people's votes]. They're both the same unwanted outcome: blocking people's votes.

In both cases, given the code as written, isn't the fact that calling reserveHatch() prevents calling lock() for a certain number of blocks, the reason both attacks are possible? Would either attack be possible if it didn't block calls to lock()?

Root Cause: [The primary factor or a fundamental reason] that imposes [an unwanted outcome] -> [Calling reserveHatch() prevents calling lock() for a certain number of blocks] that imposes [blocking other people's votes]. You yourself have that in your root cause in the submission: This design allows an attacker to exploit the reserveHatch mechanism to prevent users from executing these functions, particularly during critical voting periods.

WangSecurity commented 2 weeks ago

Here's how I see these 2 issues.

Root cause: The same: calling reserveHacth blocks the calls to lock (and other functions) for HATCH_SIZE, i.e. 5 blocks. *Note: you say that your report's root cause is short hatch delay, but there's no delay in the current code and you suggest implementing it. That's why I cannot agree it's the root cause.

The attack path:

This report: front-running calls to lock with calling reserveHatch
Cantina's report: spamming the call to reserveHatch Note: different attack paths don't mean these are different issues

The impact:

This report: prevent users from voting against your position.
Cantina's report: blocking 20% of the locks and blocking a certain percentage of locks to a voting delegate with opposing views. Note: these impacts are extremely similar and ultimately about blocking users from voting against the attacker.

The mitigation:

This report: introduces a new variable, hatchDelay, which moves the hatch later in the cycle. For example, if hatchDelay is also 5 blocks, we call reserveHatch at block 15, we can call lock in blocks 15-20, then we cannot call lock in blocks 20-25. This is a partial solution, that prevents front-running, but still allows you to block someone from voting against you.
Cantina's report: their solution focuses on allowing frequent liquidations, but decreasing the lock-blocking ratio.

Indeed, the reports suggest 2 different fixes, but they fix the attack scenarios, not the root cause. Your report fixes the front-running calls to lock to block them, and Cantina's fix fixes the spamming of reserveHatch to prevent users from calling lock.

In other words, both fixes don't mitigate the block calls to lock but mitigate the different scenarios. And as @IllIllI000 said correctly above, both attacks are possible because reserveHatch blocks users from calling lock and to mitigate this, reserveHatch shouldn't block calls to lock.

That's why I believe it's a known issue, even though two reports address 2 different scenarios, the issue is still the same. Planning to reject the escalation and leave the issue as it is.

mrhudson890 commented 2 weeks ago

calling reserveHacth blocks the calls to lock (and other functions) for HATCH_SIZE, i.e. 5 blocks.

@WangSecurity This is NOT the vulnerability, it is the mechanism. Blocking one or the other direction from the chief is the purpose of reserveHatch. Blocking "in general" is not a root cause - there is no way to avoid "some" locking due to cheif's limitations. It is the same as saying that Blockchains, Ethereum, Maker, VodeDelegate contract - are all root causes because without them there is no bug. Each of these, despite being A cause, is NOT the ROOT cause.

There is no way around blocking locks (and therefore votes) to chief for some of the time - this IS the reserveHatch mechanism.

The root cause is:

Root Cause: The primary factor or a fundamental reason that imposes an unwanted outcome

This is the minimal (fundametal) reason that allows the outcome, and this is what needs to be mitigated to prevent it - otherwise it didn't fundamentally impose the unwanted outcome:

This report: the timing of the start of the hatch, with no regard to total size and cooldown.
3.1.1 sufficient total blocked time (for liquidations), sufficient total unblocked time (for voting) - the ratio of these two is the root cause. The start of the hatch is not a consideration.

you say that your report's root cause is short hatch delay, but there's no delay in the current code and you suggest implementing it. That's why I cannot agree it's the root cause.

First, "novelty" doesn't even matter, because the absence of some code can be a root cause. The absence of whitelisting, access control, input validation, slippage protection - are all valid root causes - regardless of being present originally.

Second, this claim is factually incorrect. In the current code there is a hatch delay - it is 1 block. If there was no delay, the hatch would start from the hatch trigger, but it applies from the next block. The reserveHatch block itself is a cooldown block (reserveHatch cannot be called in the same block again).

For clarity, the reason it is done so originally - 1 block delay - is that 0 delay would be vulnerable to regular MEV (frontrunning). The delay is already there to prevent MEV, but it fails for MMEV. Note also in my mitigation diff: bool beforeHatch = block.number < hatchTrigger + hatchDelay;, if the hatchDelay is 1 we get the original's implementation's behavior of starting the hatch from next block. The delay is already there, it's hardcoded to 1. The current delay being insufficient is the root cause.

but they fix the attack scenarios, not the root cause

I think this is an illogical understanding of what a root cause is. You cannot "fix a scenario but not the root cuase" - it makes no sense. If the finding is fixable it's fix removes the root cause, otherwise it was not the "fundamental / primary" property, and it did not impose the unwanted behavior.

both fixes don't mitigate the block calls to lock but mitigate the different scenarios. .. both attacks are possible because reserveHatch blocks users from calling lock and to mitigate this, reserveHatch shouldn't block calls to lock

Again, this misunderstands the design. Blocking calls to locks is a necessary feature of the design. What you suggest - "reserveHatch shouldn't block calls to lock" is not possible - it is its only purpose.

To further drive home the difference between the two issues, both root causes and unwanted outcomes, please see these different hatch blocking cycles ("B" is blocked, "C is open and in cooldown") each starting form the RH call. As 3.1.1 notes <20% ratio as too little, we'll use less than 20% as "bad" in either extreme.

Current (delay 1, size 5, cooldown 20) - note that it starts with cooldown (cannot call RH, and start B on next block):

CBBBBBCCCCCCCCCCCCCCCCCCC: This report: :x: ; Cantina 3.1.1: :heavy_check_mark:

Delay 0:

BBBBBCCCCCCCCCCCCCCCCCCCC: This report: :x: ; Cantina 3.1.1: :heavy_check_mark:

Delay 3:

CCCBBBBBCCCCCCCCCCCCCCCCC: This report: :heavy_check_mark: ; Cantina 3.1.1: :heavy_check_mark:

Size too short, delay ok:

CCCCBBCCCCCCCCCCCCCCCCCCC: This report: :heavy_check_mark: ; Cantina 3.1.1: :x:

Cooldown too short, delay ok:

CCCBBBBBBBBBBBBBBBBBBBBBB: This report: :heavy_check_mark: ; Cantina 3.1.1: :x:

Ratio ok, delay too short:

CBBBCCCC: This report: :x: ; Cantina 3.1.1: :heavy_check_mark:

Cooldown too short, delay ok, mixed close and open:

CCCBBCBBBBBBBBBBBBBBBBBBBBBBBBCBBBBBBBBBBBBB: This report: :heavy_check_mark: ; Cantina 3.1.1: :x:

As you can see the vulnerabilities are completely unrelated beyond the fact that both are concerned with the reserveHatch mechanism.

@IllIllI000 Kindly support and explain your arguments. You've previously insisted on confusing attack path and root cause - one being implemented in code and another one actions done by attacker. Now you're making other confused but confident claims like "You yourself have that in your root cause in the submission" quoting out of context a single sentence out of a whole section. These are troll tactics and unbecoming to 2nd place on the leaderboard.

You also now "mistake" the mechanism for its vulnerabilities:

The unwanted outcome is being able to block other people's votes .. Would either attack be possible if it didn't block calls to lock()?

Again, there is no way around blocking locks (and therefore votes) to chief for some of the time - this IS the reserveHatch mechanism. Maker understands it, Cantina understands it, even I understand that. You are ignoring this. The locking is the design, without it there is no reserveHatch - it's its only purpose. The vulnerabilities are in the details of the blocking cycle. Cantina focuses only on the ratio of size/cooldown - to allow sufficient total time for each one-way traffic, and I speak to the timing of hatch start - to prevent MMEV.

There's nothing that says 20% of blocks not allowing locking is not wanted

It is not said 20% is not-wanted, it is said that the ratio can be too little, and that currently it's slightly below 20%. It says the two sides need to be considered, with one side (liquidations) not being too dominant (over voting):

A reasonable HATCH_SIZE should be chosen... ... as well as a reasonable HATCH_COOLDOWN such that ...

They are NOT saying - "do not block locks" - because it is the core functionality of that mechanism.

IllIllI000 commented 2 weeks ago

WangSecurity has been very patient with you, and has been trying to show you why this is a duplicate and invalid. Where I see that the rules are applicable and have not been mentioned, I try to mention them, since I contributed to writing them along with WangSecurity, and think I can help with that. Despite what you may think, I wasn't trolling you. It's an exact quote from your report, in from the section I mentioned, and my intention was to show you that despite your assertion that it's not a root cause under your definition, it's the one Sherlock uses, and makes sense in the context of the report submission template.

Kindly support and explain your arguments. You've previously insisted on confusing attack path and root cause - one being implemented in code and another one actions done by attacker.

I'm not confusing anything. Take a look here. I believe your claim is that only specific lines of code, or the minimal point fix of an issue, may be root causes (e.g. "These two findings have different fixable flaws (root cause) they describe and recommend to fix" above), but that's not the case. As I've highlighted, conceptual issues, as occur in this bug, can be root causes as well. Here, the {design choice} is to work around the DOS protection, by introducing the reserveHatch() function. The choice to [depend on Protocol X for admin calls] is a mistake as [it] will cause [any call] to revert -> The choice to [introduce reserveHatch() for enabling a window of blocking locks] is a mistake as [calling reservehatch()] will cause [other people's votes] to revert. Do you disagree with that replacement being applicable? The template marks the second half of the sentence as the {root cause}, and WangSecurity and I have been saying this design decision leads to this unwanted outcome.

A summary with the following structure: **{root cause} will cause [a/an] {impact} for {affected party} as {actor} will {attack path}** link -> {calling reserveHatch() will cause other people's votes to revert} which will cause {voting to fail} for {legitimate users of the protocol} as {an attacker} will {use reserveHatch() over many blocks, or in a targeted fashion via MMEV, to block votes}

Given the above summary, both your submission and 3.1.1 are the same issue, just with different attack paths. You seem to be most focused on the MMEV aspect for the recent discussion, but you also list the same attack path as 3.1.1, in your Scenario 3, so you too have said they're the same issue in your submission - or am I misunderstanding?

Again, there is no way around blocking locks (and therefore votes) to chief for some of the time

Just because the DOS protection exists, that doesn't mean that feature can't be modified, or some other new voting mechanism without the limitation couldn't be designed; they just made a design decision not to.

They are NOT saying - "do not block locks" - because it is the core functionality of that mechanism.

They added a mechanism that caused a bug (blocking votes). It doesn't matter that that's its sole purpose.

WangSecurity commented 2 weeks ago

Thanks to both of your comments.

@mrhudson890 excuse me for misunderstanding, what you meant by hatch delay and I see the 1 block hatch delay now. But, still these issues are duplicates.

As @IllIllI000 correctly said above, this is the conceptual issue of blocking the locks after calling reserveHatch. Both scenarios are still possible due to this functionality.

And, I see you disagree that the fixes in two reports fix the scenario. But I still stand by my words. Fix in Cantina’s report fixes the scenario of spamming RH to block locks, while your report fixes the scenario of front-running. Still, none of those fixes mitigate the problem of locks being blocked from voting. Hence, I still stand by my previous decision that the root cause is the same, as correctly said above, it’s the conceptual mistake of blocking locks after the RH call. Without this, both reports wouldn’t be possible and both reports fix the scenarios/attack paths they identified. Hence, the decision remains the same, reject the escalation and leave the issue as it is.

mrhudson890 commented 2 weeks ago

WangSecurity: this is the conceptual issue of blocking the locks after calling reserveHatch .. none of those fixes mitigate the problem of locks being blocked from voting .. it’s the conceptual mistake of blocking locks after the RH cal

IllIllI000: The choice to [introduce reserveHatch() for enabling a window of blocking locks] is a mistake as [calling reservehatch()] will cause [other people's votes] to revert .. They added a mechanism that caused a bug (blocking votes). It doesn't matter that that's its sole purpose.

@WangSecurity Your position is that the reserveHatch mechanism's blocking of locks is the bug / problem / mistake. However, this is a misunderstanding of the system:

The cheif contract is designed to prevent flashloan manipulation. This is done by only allowing free only if no lock was done in the same block from that caller.
VoteDelegate delegates voting power into the chief, it must be subject to that limitation and it must call lock and free to move MKR in and out of chief.
But free s are needed for liquidations (and undelegations).
This is the only reason why reserveHatch exists - to block locks, to ensure frees are not blocked for liquidations.

It's like a traffic light for regulating traffic.

Saying that blocking locks is a "conceptual mistake", "problem", "mistake", "bug" is like saying that a traffic light's blocking of traffic during (red light) is a bug / mistake / problem. But it is ALL it's there to do.

And it must be done, just like blocking conflicting traffic. Otherwise the system is vulnerable in another way. If you remove the chief's flashloan protection - you break voting, if you remove the reserveHatch - you break liquidations. This is the mechanism that balances these constraints, and it can only do it by blocking locks.

You suggest that blocking is avoidable. This is incorrect. Maker's devs, Cantina's and ChainSecurity's auditors, and all the participants in this contest haven't been able to suggest an alternative mechanism.

Given that the need for blocking locks is an necessary goal and purpose, with no safer alternatives, it cannot be a "mistake" or "bug", conceptual or otherwise.

Therefore, I don't think it's intellectually honest to say that the single unavoidable goal and property of the mechanism - blocking locks - can be the root cause of two avoidable different issues with it. This claim could be considered if both issues were unavoidable. However, both are avoidable as shown, by addressing their specific root causes: hatch delay too short in this report, and potential size/cooldown imbalance for 3.1.1. In the examples (previous comment) I also show they are independently avoidable (each can be fixed without the other)

Additionally, tying this back to the definition: the "reason" must impose the unwanted behavior. This factor cannot impose it, since both issues can be resolved with it in place - the locks are still blocked some of the time as needed. In contrast, the factors I outlined DO impose them, because they are necessary and sufficient to avoid / fix those issues. Therefore, the root causes, according to Sherlock's definition, are the ones I outlined. Specifically, they are different for the two issues.

IllIllI000 commented 2 weeks ago

The current on-chain vote delegation code does not have a reserveHatch() function. Why? Because only the owner can unlock and claim the tokens that have been delegated to them. The design decision of this update is to allow the original token delegator to claim back their tokens. The maker project has acknowledged the risk that this function causes, rather than design a new system, because it's low risk. Alternatively, if they really wanted to remove the new function but allow reclaiming of tokens, they could do it by wasting gas and creating a separate vote delegate contract per-delegator rather than per-delegatee, sending the mkr tokens to/from the delegatee contract during lock/unlock, and only locking/unlocking at the delegatee contract when there is a vote. Mitigations are not required fields of submission, and usually try to suggest changes that minimally impact the current design. I hope that the two methods that don't require a reserveHatch() are enough to finally close this submission, and that it doesn't instead devolve into no true Scotsman fallacies.

WangSecurity commented 2 weeks ago

@mrhudson890 I understand that it's how the protocol works and it's intended, still it's the block of locks that allow for both attacks. Moreover, the team is aware of the risks that locks can be locked and prevented for locking (for 5 blocks). Hence, my decision remains, I still believe these 2 reports are the same issue and only different attack paths/scenario.

Additionally, it's a common case that there are 2+ scenarios from the same root cause and they all can be fixed individually, but they still come from the same underlying reason, which I believe is the case.

mrhudson890 commented 2 weeks ago

@IllIllI000

I hope that the two methods that don't require a reserveHatch() are enough

Your first method breaks liquidations.

Your second method breaks delegation because for a vote change an unlimited number of per-delegatee contracts will need to be called to update their vote in chief.

You suggest that blocking is avoidable. This is incorrect. Maker's devs, Cantina's and ChainSecurity's auditors, and all the participants in this contest haven't been able to suggest an alternative mechanism.

WangSecurity commented 2 weeks ago

The decision remains, planning to reject the escalation and leave the issue as it is. I'll apply this decision in several hours.

mrhudson890 commented 2 weeks ago

@WangSecurity after careful consideration of your position, I believe there's a crucial distinction being missed.

it's a common case that there are 2+ scenarios from the same root cause and they all can be fixed individually, but they still come from the same underlying reason

While it's true that multiple scenarios can stem from the same root cause (there are 3 scenarios in this report), this is not the case between these two findings. The core issue isn't that blocking occurs, but how it's incorrectly applied in each case. The reserveHatch mechanism must block locks, but the "how" is what these issues address - and they highlight completely different, unrelated, and independent "hows".

Your current position - that the 'block of locks' is the root cause for both attacks - conflates the mechanism's intended function with its specific vulnerabilities. This is like saying that a car's ability to move is the root cause of both speeding and wrong-way driving accidents. While movement enables these incidents, it's not their root cause.

block of locks that allow for both attacks

Furthermore, this interpretation contradicts Sherlock's definition: blocking itself does not "impose an unwanted outcome" as it remains in all cases (so does not impose), and is therefore not fundamental or primary. The specific, incorrect implementations of blocking are what impose the unwanted outcomes and are the fundamental distinct reasons.

the team is aware of the risks that locks can be locked

The team's awareness of some risks due to locking does not include the voting manipulation via MMEV to gain extra voting power, as detailed in this report. This risk is not acknowledged in any of the extensive available sources of truth.

I still believe these 2 reports are the same issue

As shown in my previous comments (1, 2, 3), these issues share almost nothing in common beyond involving blocking locks (which is the unavoidable core of the mechanism).

As an impartial arbiter of Sherlock's rules, I urge you to reconsider your position regarding the issue being a duplicate of 3.1.1, and instead judge it on its own merits.

WangSecurity commented 1 week ago

Thank you for another comment and expressing your position again. But, my decision remains. I still believe that blocking of locks is what allows for both attacks and it's only different scenario's/attack paths as I explained in my several previous comments.

It is blocking that leads to an unwanted outcome, i.e. the attacker blocks users from voting against them. Hence, my decision remains that these 2 reports are duplicates. When I said that this risk is known, I meant that the team is aware that voting can be blocked for certain period of time, which is also the case in this finding. Hence, my decision remains as in several previous comments, planning to reject the escalation and leave the issue as it is. I will apply this decision in several hours.

Note: let's refrain from using analogies(e.g. with a car above, with traffic lights in the previous comment or similar), they're irrelevant to the discussion.

WangSecurity commented 1 week ago

Result: Invalid Unique

sherlock-admin4 commented 1 week ago

Escalations have been resolved successfully!

Escalation status:

zzykxx: rejected

sherlock-audit / 2024-06-makerdao-endgame-judging

mrhudson890 - `VoteDelegate`'s `reserveHatch` allows multi-block MEV to grief LSE users during time-sensitive voting periods #95

VoteDelegate's reserveHatch allows multi-block MEV to grief LSE users during time-sensitive voting periods

Summary

Root Cause

Internal pre-conditions

External pre-conditions

Attack Path

Impact

PoC

Mitigation

Impact unchallenged

Internal pre-conditions

Unreasonable user behavior assumptions

Coordinated occasional attacks

Lack of upgrade path

Detailed scenario for selective governance attack (Scenario 3 in the report)

Rational users

Griefing is in scope

1. Humpy attack in original report

2. DOS duration is 6 blocks

3. Time-sensitivity constraint

4. Gas fees as loss of funds

1. Time-sensitivity

2. Loss of gas fees

3 Scenario 3 - still unaddressed

`VoteDelegate`'s `reserveHatch` allows multi-block MEV to grief LSE users during time-sensitive voting periods