Closed klizhentas closed 11 months ago
Duplicate of https://github.com/gravitational/teleport/issues/3211 and https://github.com/gravitational/teleport/issues/3972
Related to #4141
GPRC tunnels don't have to be persistent. They are necessary only when certificate authority is updated, or connection is active. Most of the time the tunnels are idle, but consume goroutines and file handles.
What if instead we used another "idle mode" UDP based packet exchange and established TLS on demand for active operations.
We can also look into using https://noiseprotocol.org/noise.html and may be even parts of https://github.com/WireGuard/wireguard-go.
@knisbet what do you think about this? In your IOT experience, what did you folks use for large scale stateless connection between devices?
So if I'm reading this correct, this tends to be a very deep question, which goes alot deeper than just some protocol choices. And I probably don't grasp all the considerations related to teleport, at the IOT company we spent weeks iterating on the various options with lots of consideration, and evaluated several options seriously.
At the scale teleport operates in, I don't think which protocol to use is really the most important choice. For us, we were interested in noise for low power battery operated CPUs with a 1year target lifespan and low bandwidth 900 mhz radios. So noise had lots of properties we liked for this environment.
For something that is fully network connected and powered IOT (Wifi, LTE, wired ethernet, etc) there isn't a huge advantage of trying to apply noise, and then having to solve key distribution problems, build the protocol on top of the encryption, etc.
For the IOT space with stable networking and the device constraints we had, I was an advocate for basing the protocol around HTTP2, and using protocol buffers or an alternative for message encoding. So basically gRPC without the libs, since our embedded team didn't want the libs on the device. The actual implementation was a custom protocol, which in my opinion was a political choice and not technical.
A nice property of WireGuard/Noise though, is it allows a port to operate passively for unauthenticated traffic. However, anyone able to observe traffic will know the port is in use, and defaults can often be guessed anyways if another port is serving the teleport UI for example.
Within HTTP2 or gRPC the way to handle a server push, is to setup a long polling type operation. In HTTP1.1 this would need to be a separate connection, but in HTTP2 this is multiplexed into the existing socket. Basically just the old school concept of long polling, and let http2 multiplex this into the socket. In gRPC this would be something like a streaming rpc, that would make a request and just sit waiting for the response messages.
Going to a custom protocol on noise for example means the server can just send a message at any time, but creates the implication that all protocol semantics need to go through a design process. How to avoid head of line blocking, message interleaving and prioritization, how to ack a message, backoff standards, error handling, message ordering and buffer, flow control, etc.
There was also a serious look at MQTT, but the available brokers weren't any good, and I'm not a huge fan of trying to overlay broker semantics on an edge protocol. Trying to handle millions of topics and subscriptions and coordinate those through a series of brokers, I'm sure someone has done it, but when we looked at the industry this was suspected to be a big pain point.
Specific to teleport, this get's a bit more complicated, because the reverse tunnel I don't think is just for messages, but connections like an SSH session. This could be mapped to messaging, or the client could probably just be told to keep open a number of long polling channels and open up more if all of them are in use. Anyways, might be a bit complicated to figure out how to map this to HTTP semantics if going this route.
In my experience the architecture of the processes leads to the biggest gains in pure connection limits for lots of mostly idle devices. The route I was promoting was to go towards multiple instances of smaller processes. In rust/c this would be a single threaded event loop, with an instance launched for each CPU core. Even in go/java I would try to go this route, but maybe do pairs of cores or something, make sure the deployment is numa aware when running on metal, etc.
For GCed languages, this gave an advantage of lots of instances with a smaller memory space, shorter GC pauses, etc. It also means smaller impacts if a process crashes, or needs to be rotated for an upgrade. Basically the strategy was to ensure less shared state, whether it was GC workload, number of connections to deal with, etc.
I have no idea how teleport handles this currently, but coordinating the connection locations of a couple million devices takes a decent amount of work. That is, if I have 100 instances that an IOT device can be connected to, how is that device found for routing the message.
This all depends on the message rate for server -> IOT devices. For most scales, this probably doesn't need more than a core or two, where when a device connects to a process, the event is sent to the connection tracker. When needing to find a device, another transaction goes through the connection tracker. For redundancy, you can have two trackers, and if the tracker crashes or restarts, each node should just resend all it's connections. Something like a single instance should be expected to handle a couple hundred thousand queries per second served from memory.
This architecture bugs some engineers since it doesn't scale on it's own, so the next step up is to overlay a consistent hash and a pool of connection trackers. Downside of a consistent hash is it does balance the greatest, but makes it easier to scale up/down on the fly.
One of the semantics we designed into the protocol evolution was backoff mechanisms when the backend became available after an outage. Devices had a tendency to overload the backend with all queued metrics data.
This was actually relatively simple, where we created rules that for any endpoint the max outstanding requests could be 1. So for uploading metrics data, only one request could be outstanding at a time. This allowed the backend to operate on a priority queue, and delay processing low priority messages like the queued metrics data.
We also built in a Retry-After header semantic that the device would follow if we needed it, but the 1 outstanding request semantic was expected to be sufficient.
In our case we controlled all the infra, and ensured we had a complete TCP/IP connection to the device. This allowed us to move keepalives to the TCP layer, and not have to do context switches to keep connections alive.
This doesn't work if there are L4 or L7 devices within the TCP/IP connection, or optionally needs to move to an L7 keepalive if support for something like an HTTP proxy is required.
One thing we didn't do, but apple on the iMessage protocol was capable of doing, is they were continually probing the NAT/Firewall timeouts on the telecom networks, and automatically adapting to any timer changes. This must've been fairly complicated, but helps save battery on something like an LTE device and always ensured the backend was alive regardless of network changes.
Since the connections were mostly expected to stay up/open, at the time we didn't work on anything like TLS fast resume or the session cookies stuff. But when going TLS it is something that could help speed up handshakes of resumed connections.
Some TCP IP tuning tends to come into play, but I can't remember all the tweaks we were doing off hand. In general if you look at different projects doing c10m testing they often include the sysctl tweaks that they had to do.
Anyways, probably tons of stuff I'm forgetting... but to try and summarize.
FYI @rosstimothy ^
We've solved the number of ports required with TLS routing, and your work on using GRPC instead of SSH to the proxy solves the net.Dial over a GRPC stream request.
Closing this one. We've addressed significant portions of it already, and it's becoming increasingly clear that the next generation tunnel system for Teleport should be QUIC-based.
Use GRPC instead of SSH for reverse tunnels and
tsh
.This will allow to reduce amounts of ports to just one. SNI and client certificates can be used for switching. SSH port can be used for SSH protocol compatibility.
This will require work to implement
net.Dial
over GRPC stream protobuf.