Omarabdul3ziz commented 1 month ago

some twins on devnet (e.g., twin 29) are facing random timeouts when making calls to devnet nodes. the issue occurs inconsistently, with some calls succeeding while others fail

here are some findings:

the issue is not persistent, a node may occasionally respond (1 out of 5 calls)
the same node could respond to other twins
this affects both ts/go rmb clients
calls only succeed when the caller twin is connected to a relay not listed in the node's twin relays (if a node twin's relays are relay1 and relay2. caller on relay3 works fine, but calls fail if the caller is on relay1 or relay2)

Omarabdul3ziz commented 1 month ago

after debugging with @AhmedHanafy725 this what we found

the relay cache was not updating properly due to the chain event listener failure, causing outdated cached relay info.

what usually should happen is updating the cache on the relay in two cases.
- a twin sends a request to this relay so it updates only its cache.
- or a twin update its relays on the chain. so an events listener on each relay updates the cache based on the new relays on the chain.
we noticed that if a twin on relay 1 sends to a twin on relay 1/2, if the response comes back to relay 1 it will send it successfully. but if it goes through the relay2 and it has an outdated cache for the destination twin. it will not federate nor will not succeed in sending the response.
the relay listener stopped without any notification so the twin cached relays didn't get invalidated. and some relays are stuck sending to twins not connected. this may have happened during the recent chain node update. the downtime broke the connection and couldn't reconnect.
a restart for the relays makes the cache mechanism work fine again.
we should monitor the chain listener health in the relay. or create a separate service on the stack that gets rebooted with any chain update.

AhmedHanafy725 commented 3 days ago

As a suggestion, we can use the graphql processor to update the relay's redis cache.

Nabil-Salah commented 2 days ago

26/11/2024

we Investigated the rmb code and ways of solving this issue

Nabil-Salah commented 1 day ago

27/11/2024

worked on fixing the event listned and adding a backoff strategy

threefoldtech / rmb-rs

rmb calls timeout with some twins on devnet #200

26/11/2024

27/11/2024