bug: invalid RLN messages when other nodes are registering memberships

stubbsta commented 3 weeks ago

Problem

During RLN testing with different network sizes and RLN config, it was noticed that sometimes when there are many consecutive RLN membership registrations occurring and many messages published by other already registered nodes, then some messages in the network are invalid. Logging shows: invalid message: provided root does not belong to acceptable window of roots

Impact

If these messages are not invalid they should be relayed.

To reproduce

Using the waku-simulator https://github.com/waku-org/waku-simulator:

env. config:

NWAKU_IMAGE=harbor.status.im/wakuorg/nwaku:v0.30.0-rc.2
NUM_NWAKU_NODES=2
RLN_RELAY_EPOCH_SEC=60
RLN_RELAY_MSG_LIMIT=200
TRAFFIC_DELAY_SECONDS=0
MSG_SIZE_KBYTES=3
MAX_MESSAGE_LIMIT=300

Run the simulator and wait for the network to stabilise and start publishing messages
Get the multi-address of one of the nwaku nodes

Use the multiaddress in this script:

for i in {1..10}; do   
docker run --rm --network waku-simulator_simulation alrevuelta/go-waku-light:4fabb22 \
--eth-endpoint=http://foundry:8545 \
--contract-address=0xCf7Ed3AccA5a467e9e704C703E8D87F634fB0Fc9 \
send-messages-loop \
--priv-key=ac0974bec39a17e36ba4a6b4d238ff944bacb478cbed5efcae784d7bf4f2ff80 \
--user-message-limit=200 \
--message="light client sending a rln message" \
--content-topic=/basic2/1/test/proto \
--pubsub-topic=/waku/2/rs/66/0 \
--cluster-id=66 \
--lightpush-peer=/ip4/10.2.0.14/tcp/60000/p2p/16Uiu2HAmGwLbAZEyiUo5CJFej8i6e7mGvfZCwSWsagPfvgRin4F3 \
--message-every-secs=1 \
--epoch-size-secs=60 \
--amount-message-to-send=1

if [ $? -ne 0 ]; then
    echo "Command failed at iteration $i"
    break
fi
done

Check nwaku node logs for invalid messages or check the grafana RLN Invalid messages graph

Expected behavior

These messages should be valid.

nwaku version/commit hash

v0.30.0

alrevuelta commented 3 weeks ago

Could you try to reproduce it with just registering new memberships? Would be great to reproduce it with less moving parts.

./go-waku-light \ --eth-endpoint=http://foundry:8545 \ --contract-address=0xCf7Ed3AccA5a467e9e704C703E8D87F634fB0Fc9 \ register \ --priv-key=REPLACE_YOUR_PRIV_KEY \ --user-message-limit=200

stubbsta commented 3 weeks ago

Just registering one account at a time doesn't reliably induce the issue, but adding --amount=10 to the register command does make invalid messages appear. I tested using: -harbor.status.im/wakuorg/nwaku:v0.30.0 -block-time=3

I will do some more tests using quay.io/wakuorg/nwaku-pr:2871 and different block-times

alrevuelta commented 3 weeks ago

does make invalid messages appear.

Is the invalid reason the same as before? "invalid message: provided root does not belong to acceptable window of roots"?

stubbsta commented 3 weeks ago

yes, same reason

stubbsta commented 3 weeks ago

I also get the same issue using quay.io/wakuorg/nwaku-pr:2871 and block-time=3 I haven't seen it using that same image and block-time=60, but then I often get this error when the running the registration:

time="2024-07-04T10:22:59Z" level=fatal msg="timeout expired waiting for tx to be validated txHash: 0xaa137f6d0833f2f1cc03d546df2277435ccaa58121bb4cac574b0fde0e42cdb3: context deadline exceeded"

alrevuelta commented 3 weeks ago

I haven't seen it using that same image and block-time=60, but then I often get this error when the running the registration:

this is ok. i) its a go-waku-light error not nwaku and ii) that's a block time that i don't expect to see in reality. can be trivially fixed (increase timeout) but not worth.

stubbsta commented 2 weeks ago

Running this config on the wakudev machine reliably reproduces the issue. The average proof generation time on that machine is ~250ms
On another machine where the average proof generation time was 100ms, this config did not produce any invalid messages. The issue was only observed with NUM_NWAKU_NODES=30 on this machine

In both cases the anvil block-time was 12s.

rymnc commented 1 week ago

On another machine where the average proof generation time was 100ms, this config did not produce any invalid messages.

This is interesting. perhaps we should increase the delay between each message being sent? will reproduce

stubbsta commented 1 week ago

I think we discussed the message rate somewhere, but as a reminder in case it is relevant, the message rate using the waku-simulator rest-traffic script is the time taken to send the message (including proof generation) for each node plus the TRAFFIC_DELAY_SECONDS. In the case of 1 node and TRAFFIC_DELAY_SECONDS=0 the delay in publishing is roughly equal to the proof generation duration

alrevuelta commented 1 week ago

so the problem here is that some nodes are processing the new membership faster than others? so during this short time there is some kind of fork in the network? so the sender picked up the new membership (root_a) but the received rejects it because it didnt pick up it yet (root_b)

it will happen rarely, but we have to properly fix this, perhaps non trivial? as a random idea, we could use the underlying blockchain as a clock, that relate to rln epochs. eg rln-epoch 1000 maps to slot 1000. if you get a new message with epoch 1000 but you haven't yet processed slot 1000 (which may contain insertions) you put it on hold, first process that slot and then continue the validation of that message. with that you ensure that you have always processed the blockchain event of the rln epoch you are at.

perhaps the "onchain merkle tree" (once we remove the event sync mechanisim) this would be fixed as well?

alrevuelta commented 3 days ago

Opened this with the rootcause: https://github.com/waku-org/nwaku/issues/2928

waku-org / nwaku