vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.59k stars 1.54k forks source link

TCP load balancing & failover #288

Open binarylogic opened 5 years ago

binarylogic commented 5 years ago

A common feature across log forwards is the ability to load balance data across a static number of addresses. For example:

[sink.out]
  type = "tcp"
  addresses = ["1.1.1.1:9000", "1.1.1.1:90001"]

Alternatively, DNS could be used for this when support DNS lookup (https://github.com/timberio/vector/issues/252)

Requirements

Open questions

  1. Does this belong in Vector or is it better solved with DNS or another external system? To me, it seems like a light weight feature that would be useful.
  2. Anything else I'm missing.
binarylogic commented 5 years ago

@lukesteensen do you mind reviewing this before work begins? Just want to make sure we actually want to do this.

lukesteensen commented 5 years ago

So there is a bit of complexity here. The current TCP sink is built around the state machine of a single connection, which can be connected, connecting, disconnected, or waiting on a backoff. There are a few options for how that'd could be adjusted to handle multiple connections. (We also have no concept of a TTL like you mentioned, so I'm not even considering that for now.)

The simplest would be to just keep a list of these state machines and work through them fully one at a time. When they're all connected, this would pretty much be the round-robin you're looking for. It would break down pretty quickly for other cases though since you'd block progress on other connections during reconnects or backoffs of a single broken connection. Probably don't want to do this.

A somewhat fancier version would basically layer the state machines. You'd keep a list of roughly Either<Sink, Future<Item = Sink>>, where each represents a connection that's either ready to go or in some "not ready" state. For each incoming event, you'd iterate through the list and send to the sink if it's ready, otherwise poll to see if it has become ready. The futures would need to run in their own task to make progress, so we'd also need to deal with spawning those when a connection moves out of the ready state for whatever reason.

A different approach would be to basically spin up a copy of the existing sink per connection and then load balance over their input channels. My main concern with this approach would be that you're not actually dealing with the state of each connection, so you'd have to rely on backpressure to know when one is not available. This could lead to events getting stuck in the channel of a broken connection and potentially not being delivered.

At a higher level, this is definitely something we can do, but it's not trivial. I'd just want to make sure there's adequate demand for it before sinking in too much effort. It does make sense for some use cases (e.g. fanning out from one vector instance to multiple downstreams), but there are others (e.g. aggregating from many vectors into a few downstreams) where a connection-oriented TCP load balancer would be much simpler.

binarylogic commented 5 years ago

I agree, this definitely requires more thought and detail. I am going to unassign @bruceg until we can do that.

kirillt commented 5 years ago

For me it sounds like it should be abstracted to the topology level (balancing any sink via virtual copy of it), because generally balancing could make sense with HTTP/UDP/whatever. But agree with @lukesteensen's concern about deliverability. With implementing the feature at connection level we can rebalance events from dropped connection to any of living ones.

Probably, possible to implement a trait BalancerSupport for sinks with TCP implementation for the start, and later implement it for other sinks.

lukesteensen commented 5 years ago

Yeah, it's interesting to think about what it could look like as a topology feature instead of a sink feature. I think that'd require really stretching our TOML schemas, but it's an interesting data point for if/when we start thinking about our own fancier custom config language :smile:

LucioFranco commented 5 years ago

@lukesteensen do you think it would make sense to implement TCP sink based around tower, with an immediate response type? This would allow us to use different middleware that tower provides.

lukesteensen commented 5 years ago

I'm not sure how awkward that would be vs just doing a simple round-robin ourselves. I'm sure we could get it working, but I'd want to be sure it's a reasonable match.