Use GRPC for reverse tunnels

klizhentas commented 4 years ago

Use GRPC instead of SSH for reverse tunnels and tsh.

This will allow to reduce amounts of ports to just one. SNI and client certificates can be used for switching. SSH port can be used for SSH protocol compatibility.

This will require work to implement net.Dial over GRPC stream protobuf.

webvictim commented 4 years ago

Duplicate of https://github.com/gravitational/teleport/issues/3211 and https://github.com/gravitational/teleport/issues/3972

Related to #4141

klizhentas commented 3 years ago

GPRC tunnels don't have to be persistent. They are necessary only when certificate authority is updated, or connection is active. Most of the time the tunnels are idle, but consume goroutines and file handles.

What if instead we used another "idle mode" UDP based packet exchange and established TLS on demand for active operations.

klizhentas commented 3 years ago

We can also look into using https://noiseprotocol.org/noise.html and may be even parts of https://github.com/WireGuard/wireguard-go.

@knisbet what do you think about this? In your IOT experience, what did you folks use for large scale stateless connection between devices?

knisbet commented 3 years ago

So if I'm reading this correct, this tends to be a very deep question, which goes alot deeper than just some protocol choices. And I probably don't grasp all the considerations related to teleport, at the IOT company we spent weeks iterating on the various options with lots of consideration, and evaluated several options seriously.

Protocol Choice

At the scale teleport operates in, I don't think which protocol to use is really the most important choice. For us, we were interested in noise for low power battery operated CPUs with a 1year target lifespan and low bandwidth 900 mhz radios. So noise had lots of properties we liked for this environment.

For something that is fully network connected and powered IOT (Wifi, LTE, wired ethernet, etc) there isn't a huge advantage of trying to apply noise, and then having to solve key distribution problems, build the protocol on top of the encryption, etc.

For the IOT space with stable networking and the device constraints we had, I was an advocate for basing the protocol around HTTP2, and using protocol buffers or an alternative for message encoding. So basically gRPC without the libs, since our embedded team didn't want the libs on the device. The actual implementation was a custom protocol, which in my opinion was a political choice and not technical.

A nice property of WireGuard/Noise though, is it allows a port to operate passively for unauthenticated traffic. However, anyone able to observe traffic will know the port is in use, and defaults can often be guessed anyways if another port is serving the teleport UI for example.

Server initiated messages

Within HTTP2 or gRPC the way to handle a server push, is to setup a long polling type operation. In HTTP1.1 this would need to be a separate connection, but in HTTP2 this is multiplexed into the existing socket. Basically just the old school concept of long polling, and let http2 multiplex this into the socket. In gRPC this would be something like a streaming rpc, that would make a request and just sit waiting for the response messages.

Going to a custom protocol on noise for example means the server can just send a message at any time, but creates the implication that all protocol semantics need to go through a design process. How to avoid head of line blocking, message interleaving and prioritization, how to ack a message, backoff standards, error handling, message ordering and buffer, flow control, etc.

There was also a serious look at MQTT, but the available brokers weren't any good, and I'm not a huge fan of trying to overlay broker semantics on an edge protocol. Trying to handle millions of topics and subscriptions and coordinate those through a series of brokers, I'm sure someone has done it, but when we looked at the industry this was suspected to be a big pain point.

Specific to teleport, this get's a bit more complicated, because the reverse tunnel I don't think is just for messages, but connections like an SSH session. This could be mapped to messaging, or the client could probably just be told to keep open a number of long polling channels and open up more if all of them are in use. Anyways, might be a bit complicated to figure out how to map this to HTTP semantics if going this route.

Scaling

In my experience the architecture of the processes leads to the biggest gains in pure connection limits for lots of mostly idle devices. The route I was promoting was to go towards multiple instances of smaller processes. In rust/c this would be a single threaded event loop, with an instance launched for each CPU core. Even in go/java I would try to go this route, but maybe do pairs of cores or something, make sure the deployment is numa aware when running on metal, etc.

For GCed languages, this gave an advantage of lots of instances with a smaller memory space, shorter GC pauses, etc. It also means smaller impacts if a process crashes, or needs to be rotated for an upgrade. Basically the strategy was to ensure less shared state, whether it was GC workload, number of connections to deal with, etc.

Message Routing

I have no idea how teleport handles this currently, but coordinating the connection locations of a couple million devices takes a decent amount of work. That is, if I have 100 instances that an IOT device can be connected to, how is that device found for routing the message.

This all depends on the message rate for server -> IOT devices. For most scales, this probably doesn't need more than a core or two, where when a device connects to a process, the event is sent to the connection tracker. When needing to find a device, another transaction goes through the connection tracker. For redundancy, you can have two trackers, and if the tracker crashes or restarts, each node should just resend all it's connections. Something like a single instance should be expected to handle a couple hundred thousand queries per second served from memory.

This architecture bugs some engineers since it doesn't scale on it's own, so the next step up is to overlay a consistent hash and a pool of connection trackers. Downside of a consistent hash is it does balance the greatest, but makes it easier to scale up/down on the fly.

Recovery from flooding

One of the semantics we designed into the protocol evolution was backoff mechanisms when the backend became available after an outage. Devices had a tendency to overload the backend with all queued metrics data.

This was actually relatively simple, where we created rules that for any endpoint the max outstanding requests could be 1. So for uploading metrics data, only one request could be outstanding at a time. This allowed the backend to operate on a priority queue, and delay processing low priority messages like the queued metrics data.

We also built in a Retry-After header semantic that the device would follow if we needed it, but the 1 outstanding request semantic was expected to be sufficient.

Keepalives

In our case we controlled all the infra, and ensured we had a complete TCP/IP connection to the device. This allowed us to move keepalives to the TCP layer, and not have to do context switches to keep connections alive.

This doesn't work if there are L4 or L7 devices within the TCP/IP connection, or optionally needs to move to an L7 keepalive if support for something like an HTTP proxy is required.

One thing we didn't do, but apple on the iMessage protocol was capable of doing, is they were continually probing the NAT/Firewall timeouts on the telecom networks, and automatically adapting to any timer changes. This must've been fairly complicated, but helps save battery on something like an LTE device and always ensured the backend was alive regardless of network changes.

TLS tuning

Since the connections were mostly expected to stay up/open, at the time we didn't work on anything like TLS fast resume or the session cookies stuff. But when going TLS it is something that could help speed up handshakes of resumed connections.

Tuning

Some TCP IP tuning tends to come into play, but I can't remember all the tweaks we were doing off hand. In general if you look at different projects doing c10m testing they often include the sysctl tweaks that they had to do.

Summary

Anyways, probably tons of stuff I'm forgetting... but to try and summarize.

Something like idle mode UDP and on demand TLS is not something we considered. For our IOT use case I don't think there would be a benefit. I would look at early standards for HTTP3/QUIC first which IIRC are going the UDP direction before trying to overly complicate this. With gravity, we do have the problem of customers often missing the UDP port, because they're so used to configuring TCP rules only.
Most commodity hardware and linux kernels, should be able to handle the 100k -> 1m connection range without too much trouble (5 years ago), with some examples of software running 10m TCP connections. So protocol choice is a little less important when the connections are mostly idle, and most languages tend to be able to do this.
I did my own prototypes of 100k connections per core, using a fairly standard event loop in rust, and mtls/http2. This was mainly oriented around message passing, so dropping messages on a high performance internal event log like kafka.
In my experience, upping the number of connections is more about the software architecture than the protocol choice. Although I'm sure it's possible to get buried in the scaling capabilities of some lib, so our team tended towards more control over the protocol and implementations, and not look as seriously at higher level libs.

zmb3 commented 1 year ago

FYI @rosstimothy ^

We've solved the number of ports required with TLS routing, and your work on using GRPC instead of SSH to the proxy solves the net.Dial over a GRPC stream request.

zmb3 commented 11 months ago

Closing this one. We've addressed significant portions of it already, and it's becoming increasingly clear that the next generation tunnel system for Teleport should be QUIC-based.

gravitational / teleport