Update to use delegated-routing for querying storetheindex

BigLep commented 2 years ago

Done Criteria

Updated 2022-08-11 to capture the latest state:

[x] Hydras in production across the whole fleet query storetheindex using reframe rather than the storetheindex provider that was added in https://github.com/libp2p/hydra-booster/pull/158
[x] The custom storetheindex code in libp2p/hydra-booster is removed and deployed to production.
[x] Hydra dashboards have metrics for their calls to storetheindex. We can answer these questions:
- [x] Number of calls Hydra made to STI (regardless if successful or not)
- [x] Number of calls that Hydra got a 2xx (success) response from STI (regardless if STI has providers for the given CID or not)
- [x] Number of calls that fataled on the server (e.g., 5xx due to server issue)
- [x] Number of calls where client timed out (and thus didn't get a server response)
- [x] Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers.
- [x] Latecy of each request, broken out by status code.

Why Important

Provides first production validation of delegated routing, giving us the confidence to add it to Kubo as part of https://github.com/ipfs/go-ipfs/issues/8775

Notes

We will use the ipld/edelweiss generated version of ipfs/go-delegated-routing happening in https://github.com/ipfs/go-delegated-routing/pull/11
The depends on storetheindex to expose a delegated-routing endpoint, happening in https://github.com/filecoin-project/storetheindex/issues/251
https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test-.* is the Hydra dashboard that should be updated.
Cases where storetheindex has 0 results for a given CID and the corresponding status code is an open spec item being clarified in https://github.com/ipfs/specs/issues/308.
This is "Stage 0" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7
We don't need to include/deploy the latest "read"-related functionality in the reframe spec, including HTTP caching. That will happen separately when https://github.com/ipfs/go-delegated-routing/issues/27 completes.

Estimation Notes

2022-08-19 estimates of work remaining:

petar commented 2 years ago

@BigLep All tasks are done here, specifically:

Hydra has metrics for Reframe path
STI has metrics for Reframe path
All old STI code is removed from Hydra
Both Hydra and STI use the newest version of Delegated Routing and Edelweiss
We've verified that Hydra and STI talk to each other

The next steps would be:

deployment of STI to production (@willscott), then
deployment of Hydra to production (@petar)

BigLep commented 2 years ago

Thanks @petar . Lets track the storetheindex production deployment in https://github.com/filecoin-project/storetheindex/issues/251 . Hydra production deployment will be here.

To be clear, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Good stuff - almost there!

BigLep commented 2 years ago

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

petar commented 2 years ago

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Yes. The delegated routing code replaces the STI code and uses the same metric names. So the dashboard should work unchanged.

BigLep commented 2 years ago

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

petar commented 2 years ago

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

Yes.

willscott commented 2 years ago

@thattommyhall I believe you pinged that we're still using the older protocol for all but the test instance - and that's what i see on the dashboard as well.

I didn't see the removal of the http indexer code go past on github, but noting that it means we probably should coordinate broader deployment of reframe before we end up with a deployment that doesn't support the current setup.

BigLep commented 2 years ago

Discussion about this effort is currently happening in the #reframe channel: https://filecoinproject.slack.com/archives/C03RQFURZLM

BigLep commented 2 years ago

Per 2022-08-12 verbal conversations, @guseggert is going to drive this effort to close and will consult with @petar as needed.

guseggert commented 2 years ago

So the logs originally showed the timeouts were due to a timeout while reading the response body:

2022-08-14T09:37:46.803Z    ERROR   service/client/delegatedrouting proto/proto_edelweiss.go:1234   client received error response (context deadline exceeded (Client.Timeout or context cancellation while reading body))

This morning I deployed https://github.com/libp2p/hydra-booster/commit/37dda2209d91ad0dd213534248382e0f144d1991 to the test flight, which publishes more detailed error metrics for Reframe and also upgrades to go-libp2p@v0.21 and Go 1.18. After deployment, the timeouts basically disappeared, and it's been baking for a few hours and the traffic is at similar levels now without timeouts. This leads me to believe that the issue is probably that the Hydra node was taking too long to read the response body due to some environmental issue (overload from some other work it was doing). Something in the go-libp2p or Go upgrades might have also alleviated the bottleneck, e.g. libp2p resource manager. We see similar timeouts in prod with the non-Reframe StoreTheIndex client, although not nearly at the same rate, but the test flight could have gotten unlucky and been placed into a hot partition, so it could still be the same issue.

My next step is to get some metrics into the dashboard on libp2p Resource Manager to see if it's throttling anything, and understand the impact of that on the network, see if we need to tweak limits, etc. I'm guessing that RM is throttling b/c the AddProvider rate is much lower, while the STI rate is the same.

Enumerating the options to mitigate overloading:

Add a rate limiter to cap the rate of AddProvider DHT calls
- We should do this regardless of the other options, as this is the only way we can prevent nodes from being overloaded when traffic patterns change
- Some calls will start failing for the other nodes that are calling AddProvider, what's the impact of this?
- This might already be happening with the libp2p upgrade and libp2p resource manager
Reduce the number of heads that each node runs
- This will increase the overall cost as we'll need to scale up the fleet to accomodate
- Traffic pattern changes in the network could still cause overloads
Do some analysis on the nature of the calls to see if caching could alleviate the load
- Are there hot CIDs/addrs that we could shed w/ caching?
- Traffic pattern changes in the network could still cause overloads

guseggert commented 2 years ago

I've integrated Resource Manager, added RM metrics, added them to the dashboard, and tweaked the RM limits to be low enough to throttle spikes but to generally allow most traffic. The test node is now operating at the same capacity as before, but with minimal timeouts. There are still occasional timeouts (about 0.3% of reqs). These are timeouts reading response headers, so this may be a server-side thing, although I will increase the client-side timeout to 3 seconds to allow for e.g. GC to run w/o causing timeouts.

I'm working on this branch: https://github.com/libp2p/hydra-booster/tree/feat/reframe-metrics

I'll get a PR worked up, and continue to let this bake today. If it looks okay tomorrow morning, I'll roll it out to the rest of the fleet.

BigLep commented 2 years ago

2022-08-19 conversation:

PR is out with updated libp2p, go version, metrics, etc: https://github.com/libp2p/hydra-booster/pull/177
We've deployed to the test suite.
Planning to deploy to production 2022-08-22

BigLep commented 2 years ago

@guseggert : other thoughts from looking at this after:

I worry that it isn't going to be clear for anyone looking at "StoreTheIndex Reuests / Sec" what "Net", "NetTimeout", and "Other" mean. Can we maybe add an "info panel" (assuming something like that exists) with an explainer note and link to canonical information.
Please handle https://github.com/ipfs/specs/issues/308 and ensure go-delegated-routing is doing the right thing.
For the latency metrics, do we have other values besides average. For example, I'm curious what the p99/p100 is for "success".
Did we do this from the done criteria: "Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers."

guseggert commented 2 years ago

1) is done.

I did 2) a couple weeks ago, see https://github.com/ipfs/specs/commit/0b123fa5c3ddf126f55bcec2aa67b2b91247d959

3) done in https://github.com/libp2p/hydra-booster/commit/6967a65c3f6945c40a6cc1ac8d378bbf173e539d

4) done in https://github.com/libp2p/hydra-booster/commit/d48d8984c19c125c55c199a5e790bed93d9ed2b7

guseggert commented 2 years ago

Update: last week I deployed Reframe to the full Hydra fleet, but almost all reqs started timing out so I rolled it back. Have been debugging w/ @masih in between traveling.

Yesterday there was an STI event that caused the HTTP endpoint to behave a like the Reframe timeouts, so I'm working with @masih to understand the root cause. If it doesn't rule out Reframe, then I'll wait for the root cause fix and redeploy to see if it also works for Reframe, and if it doesn't then we'll need to do some req tracing through the infrastructure to see where exactly the timeouts are occurring. This might require adding request IDs to the reqs and passing those through LBs, proxies, etc., and adding to log messages on the STI side.

masih commented 2 years ago

All fixes for the storetheindex outage yesterday are now deployed. At this time it is unclear if the fixes would also resolve the timeouts observed when reframe was deployed. We can try deployment and see if it does if that's not too disruptive to the users in case it doesn't.

BigLep commented 2 years ago

Thanks for the updates guys. I'll keep following - let me know if anything is needed.

guseggert commented 2 years ago

Extensive update of metrics (added resource manager metrics, length of STI responses, etc.)
- https://github.com/libp2p/hydra-booster/pull/177
Pushed through the edelweiss change to allow "cachable" methods, which switches FindProviders from POST to GET so that it can be cached by a CDN
- https://github.com/ipld/edelweiss/commit/fbd8dae9794725687269c7375dc013c842884b9c
Plumbed that through go-delegated-routing
- https://github.com/ipfs/go-delegated-routing/commit/30ca77f67baddf32e1fd95e685e7186449d88089
Deployed to the Hydras
- https://github.com/libp2p/hydra-booster-infra/commit/e0d01590b20d5800524adbf8ef90d8667141598c
Updated dashboard

This is now deployed and operational, so closing.

petar commented 2 years ago

Yes.

On Tue, Jul 26, 2022 at 8:01 AM Steve Loeppky @.***> wrote:

@petar https://github.com/petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

— Reply to this email directly, view it on GitHub https://github.com/libp2p/hydra-booster/issues/162#issuecomment-1195599130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACFTS465KJWWGSRX5XSCO3VV74TRANCNFSM5QICOUOA . You are receiving this because you were mentioned.Message ID: @.***>

libp2p / hydra-booster