bug: RLN inconsistent state rejects valid messages

alrevuelta commented 4 months ago

Originally reported in https://github.com/waku-org/nwaku/issues/2870 this issue refers to the same problem but expands on the root cause of it, with a possible solution to fix it.

In short, valid messages containing valid RLN proofs are rejected by some nodes with invalid message: provided root does not belong to acceptable window of roots. This bug is difficult to reproduce but shows up more frequently when new RLN memberships are being inserted in the contract. With a static amount of memberships this won't show up.

The problem is the following:

Each node contains a Merkle Tree with all RLN memberships.
The tree is constructed with onchain information.
Blockchain is queried every DefaultBlockPollRate.
This means that when a new membership is added to the contract, nwaku nodes will add it to the tree at different times, varying from 0 to 6 seconds (DefaultBlockPollRate).
During this interval, nodes may have different Merkle trees, with different roots.
Hence leading to the invalid message: provided root does not belong to acceptable window of roots error in some nodes.

As shown in the image, we have some sort of race condition during this interval. If nwaku1 creates a message at t1 and propagates it to the network, nwaku2 will flag it as invalid and nwaku3 as valid. There is an inconsistency between the nodes, where the merkle trees are not guaranteed to be the same.

Known the issue, there would be 2 ways to mitigate it (but not completely fix it):

Reduce DefaultBlockPollRate. A lower value would reduce how likely this is to happen, since it will shorten the "ambiguity window". But the issue will still be there.
Use an event-based approach (eg websockets). Subscribe to the tip of the blockchain, and for every new block head event, update the tree. This reduces the "ambiguity window", but in theory the problem can still happen. And well, syncing state based on event-based-ws-like showed clear problems in the past as afaik there are not guarantees that you won't miss an event.

So my proposal to fix this would be the following. We need to somehow connect RLN timestamps (aka epochs) to blockchain timestamps (aka blocks or slots). This would allow a proper ordering of events, so we can know if the RLN message we are receiving is from the future. In other words, if we receive a RLN message with timestamp 10 but we have just synced the blockchain till timestamp 9, then we know that we can't process this message, since we are not up to date. In this case, the node would need to sync first and validate after. With this, we will no longer reject a valid message, since we first sync, process after. The solution has to be ofc well designed to avoid unnecessary delays.

It's up for discussion how "connecting RLN timestamps to blockchain timestamps" will work. If using Ethereum L1, slots could be an option, since the happen at fixed 12s intervals. Unsure how it will work in L2s. And well, as a requirement we need a solution that is portable across chains/layers.

rymnc commented 4 months ago

Use an event-based approach (eg websockets). Subscribe to the tip of the blockchain, and for every new block head event, update the tree. This reduces the "ambiguity window", but in theory the problem can still happen. And well, syncing state based on event-based-ws-like showed clear problems in the past as afaik there are not guarantees that you won't miss an event.

this approach is probably less robust than our current implementation

rymnc commented 4 months ago

As discussed in our call, the option of linking RLN epochs to the underlying blockchain's slot is a promising idea, with some noteworthy tradeoffs -

the slot time will become our epoch size, i.e with ethereum mainnet, it's 12 seconds, and therefore the rln epoch size should be aligned to that.
unclear if this solution will work with chains that do not have slot-based block production/fixed-time intervals for block production
This will need a reconciliation mechanism for missed block proposals

Here's what the pseudocode would look like

[cached] lastProcessedBlockNumber
[cached] lastProcessedBlockTimestamp
[constant] blockSlotSize

func onNewBlock(block) ->
  processBlock(block)

  lastProcessedBlockNumber = block.number
  lastProcessedBlockTimestamp = block.timestamp

func validateMessage(wakuMessage) ->
  validateEpoch(wakuMessage)

  if wakuMessage.timestamp > lastProcessedBlockTimestamp + blockSlotSize:

    # this means that 
    # we may be lagging behind the block that the sender has processed
    # so, we force process the latest block, and then verify the proof
    # this leads to a probable dos vulnerability where the sender intentionally spams
    # a node with this condition, and makes unnecessary calls to the rpc provider/ethereum node
    forceProcessLatestBlock()
    verifyProof(wakuMessage.proof)

We will have to do some preliminary analysis on variance of block-times of L2s we may be interested in for mainnet deployment to estimate actual values of blockSlotSize, and subsequently, the epochSize

cc: @alrevuelta

alrevuelta commented 4 months ago

@rymnc thanks! In general I like the idea, but perhaps there is an attack vector here.

wakuMessage.timestamp is set by the sender. One attacker could artificially set a timestamp in the future. This will force forceProcessLatestBlock to be executed. Perhaps not a big deal because if the node is already in sync, forceProcessLatestBlock will return without any onchain call? But perhaps we need to add some checks in here to prevent that. For example, reject timestamps greater than the lastHeadBlockTimestamp, but well this has other problems. If the clock of the sender is off by a bit, we may reject valid messages.

On the other hand, I'm concerned because this still doesn't deterministically ensure messages are processed correctly since it relies on the local timestamp of each node. The shorter the block time, the more impact a biased clock will have.

But in any case, I think it's a great way to start.

waku-org / nwaku

bug: RLN inconsistent state rejects valid messages #2928