hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.79k stars 1.94k forks source link

RFC: Use QUIC for RPC transport #23848

Open schmichael opened 3 weeks ago

schmichael commented 3 weeks ago

Background

Nomad like Consul, uses yamux for its RPC layer's underlying network transport. Yamux is based on SPDY. SPDY has been obsolete since 2015, although its ideas form the basis of HTTP/2's transport layer.

Yamux has proven powerful and reliable, needing and receiving very little maintenance over its 10 year lifespan. However this means that there's very little expertise in the codebase when issues do arise, and the code does not adhere to modern Go idioms.

Proposal

Replace Nomad's use of Yamux with QUIC. QUIC is the basis for HTTP/3, but unlike SDPY+HTTP/2, QUIC is being intentionally standardized independently (RFC 9000), and is being proposed for more widespread use such as DNS-over-QUIC (RFC 9250).

UDP

QUIC is based on UDP instead of TCP which poses both an opportunity and risk for Nomad:

  1. Opportunity: Since Nomad's existing Yamux-based RPC mechanism uses TCP, QUIC support could be added on the same port but listening for UDP packets. This provides a way to transition protocols that maintains backward compatibility and requires no user intervention (potentially, see below).
  2. Risk: UDP is notoriously mangled by middle boxes the world over. Until QUIC gains widespread adoption, Nomad might find itself a nonstarter for many environments that haven't validated QUIC yet.

This does allow Nomad to add QUIC support at any time and implement an IPv6 Happy Eyeballs style algorithm for determining whether to use the TCP/Yamux or UDP/QUIC transport.

TLS

QUIC mandates TLS. This would require Nomad to mandate TLS and pose a significant upgrade hurdle. Implementing something like Consul's auto config would be necessary to ease the transition, although there's likely no way to upgrade to TLS without forcing some user intervention.

  1. Opportunity: Make TLS the default in Nomad and therefore more secure by default.
  2. Risk: There is not only initial operator toil in setting up TLS, but ongoing maintenance cost in rotating TLS certificates.

Go Implementations

QUIC is not officially supported by the Go standard library as of Go 1.23. The crypto/tls package exposes some QUIC internals but is not intended for direct use. golang/go#58547 tracks QUICs inclusion in Go's stdlib. golang.org/x/net contains the WIP implementation that is intended to be the basis of Go's future HTTP/3 support.

Multiple third party QUIC implementations exist as well, although quic-go seems like the dominant implementation:

  1. https://github.com/quic-go/quic-go
  2. https://github.com/goburrow/quic

The proposed choice for Nomad would be to use a stdlib implementation to ensure the widest compatibility and most support.

Alternative: libp2p

libp2p forked yamux and has done quite a bit more maintenance. Switching to or merging their fork is a far less significant change than switching protocols.

Alternative: HTTP/3

Instead of switching yamux->quic, Nomad could switch from rpc->http/3. This could entail dropping the entire RPC subsystem (which itself is quite antiquated and lacks basic features such as context cancellation). All RPCs without a corresponding HTTP API would need to have an HTTP API implemented. Raft currently uses its own TCP connection and would need special consideration when moving to HTTP.

This would be a huge undertaking, and there's no reason to do it at the same time as moving from yamux->quic. Upgrading our RPC implementation can be done independently of choosing an underlying transport.

Roadmap

There is no roadmap for implementing QUIC in Nomad.

Please leave feedback in the form of:

  1. A simple :+1: or :-1: reactimoji if you feel like QUIC would be beneficial/negative for your use case.
  2. A comment if you have particular ideas, questions, concerns, or proposals.
lattwood commented 3 weeks ago

Specific concerns, wrt UDP mangling on the internet, I'm pretty sure that fly.io isn't the only company that was/is operating a single Nomad control plane for a global cluster over the public internet. (proof: https://fly.io/blog/carving-the-scheduler-out-of-our-orchestrator/)

lattwood commented 6 days ago

https://dl.acm.org/doi/10.1145/3589334.3645323

Another nail in QUIC's coffin