Closed BigLep closed 2 years ago
@BigLep All tasks are done here, specifically:
The next steps would be:
Thanks @petar . Lets track the storetheindex production deployment in https://github.com/filecoin-project/storetheindex/issues/251 . Hydra production deployment will be here.
To be clear, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
Good stuff - almost there!
@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
Yes. The delegated routing code replaces the STI code and uses the same metric names. So the dashboard should work unchanged.
@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?
@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?
Yes.
@thattommyhall I believe you pinged that we're still using the older protocol for all but the test instance - and that's what i see on the dashboard as well.
I didn't see the removal of the http indexer code go past on github, but noting that it means we probably should coordinate broader deployment of reframe before we end up with a deployment that doesn't support the current setup.
Discussion about this effort is currently happening in the #reframe channel: https://filecoinproject.slack.com/archives/C03RQFURZLM
Per 2022-08-12 verbal conversations, @guseggert is going to drive this effort to close and will consult with @petar as needed.
So the logs originally showed the timeouts were due to a timeout while reading the response body:
2022-08-14T09:37:46.803Z ERROR service/client/delegatedrouting proto/proto_edelweiss.go:1234 client received error response (context deadline exceeded (Client.Timeout or context cancellation while reading body))
This morning I deployed https://github.com/libp2p/hydra-booster/commit/37dda2209d91ad0dd213534248382e0f144d1991 to the test flight, which publishes more detailed error metrics for Reframe and also upgrades to go-libp2p@v0.21 and Go 1.18. After deployment, the timeouts basically disappeared, and it's been baking for a few hours and the traffic is at similar levels now without timeouts. This leads me to believe that the issue is probably that the Hydra node was taking too long to read the response body due to some environmental issue (overload from some other work it was doing). Something in the go-libp2p or Go upgrades might have also alleviated the bottleneck, e.g. libp2p resource manager. We see similar timeouts in prod with the non-Reframe StoreTheIndex client, although not nearly at the same rate, but the test flight could have gotten unlucky and been placed into a hot partition, so it could still be the same issue.
My next step is to get some metrics into the dashboard on libp2p Resource Manager to see if it's throttling anything, and understand the impact of that on the network, see if we need to tweak limits, etc. I'm guessing that RM is throttling b/c the AddProvider rate is much lower, while the STI rate is the same.
Enumerating the options to mitigate overloading:
I've integrated Resource Manager, added RM metrics, added them to the dashboard, and tweaked the RM limits to be low enough to throttle spikes but to generally allow most traffic. The test node is now operating at the same capacity as before, but with minimal timeouts. There are still occasional timeouts (about 0.3% of reqs). These are timeouts reading response headers, so this may be a server-side thing, although I will increase the client-side timeout to 3 seconds to allow for e.g. GC to run w/o causing timeouts.
I'm working on this branch: https://github.com/libp2p/hydra-booster/tree/feat/reframe-metrics
I'll get a PR worked up, and continue to let this bake today. If it looks okay tomorrow morning, I'll roll it out to the rest of the fleet.
2022-08-19 conversation:
@guseggert : other thoughts from looking at this after:
1) is done.
I did 2) a couple weeks ago, see https://github.com/ipfs/specs/commit/0b123fa5c3ddf126f55bcec2aa67b2b91247d959
3) done in https://github.com/libp2p/hydra-booster/commit/6967a65c3f6945c40a6cc1ac8d378bbf173e539d
4) done in https://github.com/libp2p/hydra-booster/commit/d48d8984c19c125c55c199a5e790bed93d9ed2b7
Update: last week I deployed Reframe to the full Hydra fleet, but almost all reqs started timing out so I rolled it back. Have been debugging w/ @masih in between traveling.
Yesterday there was an STI event that caused the HTTP endpoint to behave a like the Reframe timeouts, so I'm working with @masih to understand the root cause. If it doesn't rule out Reframe, then I'll wait for the root cause fix and redeploy to see if it also works for Reframe, and if it doesn't then we'll need to do some req tracing through the infrastructure to see where exactly the timeouts are occurring. This might require adding request IDs to the reqs and passing those through LBs, proxies, etc., and adding to log messages on the STI side.
All fixes for the storetheindex outage yesterday are now deployed. At this time it is unclear if the fixes would also resolve the timeouts observed when reframe was deployed. We can try deployment and see if it does if that's not too disruptive to the users in case it doesn't.
Thanks for the updates guys. I'll keep following - let me know if anything is needed.
This is now deployed and operational, so closing.
Yes.
On Tue, Jul 26, 2022 at 8:01 AM Steve Loeppky @.***> wrote:
@petar https://github.com/petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?
— Reply to this email directly, view it on GitHub https://github.com/libp2p/hydra-booster/issues/162#issuecomment-1195599130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACFTS465KJWWGSRX5XSCO3VV74TRANCNFSM5QICOUOA . You are receiving this because you were mentioned.Message ID: @.***>
Done Criteria
Updated 2022-08-11 to capture the latest state:
Why Important
Provides first production validation of delegated routing, giving us the confidence to add it to Kubo as part of https://github.com/ipfs/go-ipfs/issues/8775
Notes
Estimation Notes
2022-08-19 estimates of work remaining: