Describe the bug
The missing messages fetch from store node and send message confirmations using store node logic is in go-waku API layer.
As per @NagyZoltanPeter , when store-node rate-limits were enabled in status.staging fleet node, the rejected traffic was not migrated to a different storenode (neither in staging nor prod fleets).
This will cause a bad user-experience to status app users as send message confirmations will fail and missing message fetch would also keep failing.
Root-cause seems to be the fact that mailserver/storeNode selection is triggered from status-go mailserver cycle which uses RTT and failedRequests to switch storeNodes. But since failed periodic store queries such as missing messages fetch and send confirmations data is not passed back to mailserver selection logic, if a storeNode is configured with rate-limits and rejects requests failover may not happen to different storenode causing reliability issues.
Note that Ping might still be successful to the storeNode as that doesn't fall under rate-limit of store requests which might make mailserver cycle logic think that active storenode can still server requests.
Solution:
Best approach/fix is to use the peer-selection logic within go-waku for store node selection and not sure storenode selected by mailserver cycle logic.
Expected behavior
In case requests fail to a storenode (either due to rate-limit or some other application error such as timeout), the periodic requests should switch to a different store node.
@richard-ramos i remember you were working on mailserver refactor, looks like we need the logic to be moved to peer-selection sooner now in order for this to be addressed.
Describe the bug The missing messages fetch from store node and send message confirmations using store node logic is in go-waku API layer. As per @NagyZoltanPeter , when store-node rate-limits were enabled in
status.staging
fleet node, the rejected traffic was not migrated to a different storenode (neither instaging
norprod
fleets).This will cause a bad user-experience to status app users as send message confirmations will fail and missing message fetch would also keep failing.
Root-cause seems to be the fact that mailserver/storeNode selection is triggered from status-go mailserver cycle which uses RTT and failedRequests to switch storeNodes. But since failed periodic store queries such as missing messages fetch and send confirmations data is not passed back to mailserver selection logic, if a storeNode is configured with rate-limits and rejects requests failover may not happen to different storenode causing reliability issues.
Note that Ping might still be successful to the storeNode as that doesn't fall under rate-limit of store requests which might make mailserver cycle logic think that active storenode can still server requests.
Solution: Best approach/fix is to use the peer-selection logic within go-waku for store node selection and not sure storenode selected by mailserver cycle logic.
Expected behavior In case requests fail to a storenode (either due to rate-limit or some other application error such as timeout), the periodic requests should switch to a different store node.
Screenshots
Ref convo on discord https://discord.com/channels/1110799176264056863/1267617790832021545/1283126578699567196
@richard-ramos i remember you were working on mailserver refactor, looks like we need the logic to be moved to peer-selection sooner now in order for this to be addressed.