Update Hydras to new HTTP Delegated Routing

BigLep commented 1 year ago

Done Criteria

Hydras are using the HTTP Delegated Routing version compatible with https://github.com/ipfs/specs/pull/337 in production.

Why Important

See motivation in https://github.com/ipfs/specs/pull/337

Notes

Estimate in https://github.com/ipfs/go-delegated-routing/issues/63 was 2 developer days
Tracking issue for this overall effort: https://github.com/ipfs/go-delegated-routing/issues/63.
- Timeline coordination is happening there.
There are multiple steps:
- [ ] Add/merge client compatible with https://github.com/ipfs/specs/pull/337 (which includes removing the legacy Reframe client)
- [ ] (After cid.contact deployment https://github.com/filecoin-project/storetheindex/issues/1038) Deploy new Hydra code with updated client

guseggert commented 1 year ago

Main code change in https://github.com/libp2p/hydra-booster/pull/185

I have also turned off OpenSSL in the Docker build since it keeps causing problems, it's now using Go's crypto. I'll monitor perf around that.

I've deployed this to the test instance, see https://github.com/libp2p/hydra-booster-infra/pull/14. I've also updated the dashboards with the new metrics.

I'll let it bake overnight, if everything looks good tomorrow then I'll deploy to the whole fleet.

BigLep commented 1 year ago

Hi @guseggert . Did the prod deployment happen? Are there client side (Hydra) and server-side (cid.contact) graphs you're monitoring?

guseggert commented 1 year ago

No not yet, it was getting late Fri and I didn't want to deploy late on a Fri. Today I looked into why CPU usage was much higher than expected (almost 2x). I expected something related to disabling OpenSSL, but CPU profiles showed most time spent in GC, and allocation profiles showed top allocations were in libp2p resource manager's metric publishing, which generates a ton of garbage in the tags that it adds to metrics. So I disabled that--we don't use it anyway, hydra calculates its own resource manager metrics. That's now deployed to test and CPU usage looks much better, as does long-tail latency on cid.contact requests.

This became an issue now because I also upgraded libp2p to the latest version to pick up all the security updates.

Letting this bake again tonight and will take a look in the AM. Will also open an issue w/ go-libp2p to reduce the garbage generated by the resource manager metrics.

BigLep commented 1 year ago

@guseggert : how is this looking?

Also, please share the issue with go-libp2p when you have it.

guseggert commented 1 year ago

I was able to grab another profile showing the OpenCensus tag allocations from OpenCensus, opened an issue with go-libp2p here: https://github.com/libp2p/go-libp2p/issues/1955

I've been fighting with resource manager and I have given up on it and turned it off, and things are looking better now. Every time I would fix one limit, another would pop up and cause some degenerate behavior somewhere else, and chasing down the root cause of throttles is non-trivial. We need to move forward here so I am just disabling resource manager for now.

BigLep commented 1 year ago

@guseggert : can you also point to how you were configuring the resource manager? (I'm asking so can learn what pain another integrate experienced.) I would have expected us toonly have limits like Kubo's strategy.

guseggert commented 1 year ago

Each hydra host is effectively running many Kubo nodes at the same time, and they also don't handle Bitswap traffic, so the traffic pattern is pretty different from a single Kubo node. We have high-traffic gateway hosts to compare with but they are even more different (eg accelerated DHT client).

The RM config currently deployed to prod hydras is here: https://github.com/libp2p/hydra-booster/blob/master/head/head.go#L82 . Note that those are per-head limits. After upgrading from go-libp2p v0.21 to v0.24, there was significantly more throttling, so I've been tweaking them locally and in a branch. As part of that, I pulled resource manager and connection manager out to be shared across heads instead, which makes reasoning about limits easier.. When RM throttling was interfering, there was much less processed reqs by the DHT but much higher mem usage and goroutines, mostly stuck on the identify handshake...I didn't trace through the code but I'm suspecting that they were somehow stuck due to RM throttling, since everything's running fine now with RM off.

guseggert commented 1 year ago

Coordinated with @masih this morning to flip the full Hydra fleet over to the HTTP API. Things are looking fine. The p50 cid.contact latency has dropped from ~36 ms (via reframe) to ~18 ms (via HTTP API).

BigLep commented 1 year ago

Resolving since done criteria is satisfied.

libp2p / hydra-booster