Open Omarabdul3ziz opened 1 month ago
after debugging with @AhmedHanafy725 this what we found
the relay cache was not updating properly due to the chain event listener failure, causing outdated cached relay info.
what usually should happen is updating the cache on the relay in two cases.
we noticed that if a twin on relay 1 sends to a twin on relay 1/2, if the response comes back to relay 1 it will send it successfully. but if it goes through the relay2 and it has an outdated cache for the destination twin. it will not federate nor will not succeed in sending the response.
the relay listener stopped without any notification so the twin cached relays didn't get invalidated. and some relays are stuck sending to twins not connected. this may have happened during the recent chain node update. the downtime broke the connection and couldn't reconnect.
a restart for the relays makes the cache mechanism work fine again.
we should monitor the chain listener health in the relay. or create a separate service on the stack that gets rebooted with any chain update.
As a suggestion, we can use the graphql processor to update the relay's redis cache.
we Investigated the rmb code and ways of solving this issue
worked on fixing the event listned and adding a backoff strategy
some twins on devnet (e.g., twin 29) are facing random timeouts when making calls to devnet nodes. the issue occurs inconsistently, with some calls succeeding while others fail
here are some findings: