[Milestone] Operational resilience from recent events

eta: 2023-06-30

description: We have graduated from the "school of hard knocks" from the last 6 months. Production operators avoid known failure modes and are more self-service in identifying issues that are affecting their users and potentially the whole users. This includes keeping resilient routing tables void of non-responsive nodes, guardrails to help providers from unknowing fall behind in providing, and improved bitswap with backpressure, timeouts, and metrics.

Notes: This is a Starmap "child" issue.

These are all related to operational pain from the last 6 months. They can be delivered on independently.

Resilient routing tables: Related event: https://github.com/protocol/ipfs-vulnerabilities/issues/25 https://github.com/libp2p/go-libp2p-kad-dht/issues/811

Providing guardrails: https://github.com/ipfs/kubo/issues/9703 https://github.com/ipfs/kubo/issues/9702 https://github.com/ipfs/kubo/issues/9704

Improved bitswap: TODO: create a better issue or repurpose https://github.com/ipfs/go-bitswap/issues/560

ipfs / kubo

[Milestone] Operational resilience from recent events #9820