Large nflogs/silences can never be gossiped

stuartnelson3 commented 6 years ago

One issue we need to deal with:

The internal queue of messages to gossip is currently unbounded, and only messages under a certain size can be sent (currently 1400 bytes (configurable) - msg_overhead (memberlist messages, usually a few byte)). Any messages that push the total gossip size past this "gossip size limit" remain in the queue and are attempted to be sent next time ... but they can never be sent, because they're too large.

Looking at our setup, we have some alerts that can greatly exceed this max size for a gossip message:

2018-06-12_12:02:03.97148 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/2

These are just some internal log lines I added, but the important part is that overhead + size_used + size_msg needs to be LESS than limit. Because size_msg is way larger than limit, this will sit in the queue forever.

The messages slowly pile up:

2018-06-12_12:16:20.77173 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=12/13
2018-06-12_12:16:20.77174 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=11/13
2018-06-12_12:16:20.77175 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=10/13
2018-06-12_12:16:20.77176 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=9/13
2018-06-12_12:16:20.77177 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=8/13
2018-06-12_12:16:20.77178 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=7/13
2018-06-12_12:16:20.77179 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=6/13
2018-06-12_12:16:20.77185 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=5/13
2018-06-12_12:16:20.77186 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=4/13
2018-06-12_12:16:20.77189 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=3/13
2018-06-12_12:16:20.77190 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=2/13
2018-06-12_12:16:20.77191 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/13

Now the hard part ... how do we solve this? Do we not gossip messages that are too large, and raise the limit? Do we hack up any messages that exceed our byte limit and send them piecemeal? Do we attempt to make a hash of hashes for "large" messages, and fall back to just doing a direct comparison? What do we set the limit to?

@brancz @fabxc @simonpasquier @mxinden @brian-brazil

Cross posted from #1340

stuartnelson3 commented 6 years ago

@brancz

Are we sure this is actually a single message that we're talking about? To me this seems like the Alertmanager instances are trying to gossip the FullState at once, and that's simply too large.

stuartnelson3 commented 6 years ago

@simonpasquier

Indeed I can reproduce the problem (for instance by creating a silence with a very long comment). Eventually the silence gets replicated when the nodes do a full-state sync but the queue doesn't drain the messages which are too large. I can't see an easy solution for this except prune the queue regularly?

stuartnelson3 commented 6 years ago

@brancz

How about a reasonable limit for comments and extract comments out of the silencelog entry, and the same for anything that's currently not limited. If the message size limit is way to small we should of course adapt that as well.

stuartnelson3 commented 6 years ago

@stuartnelson3

The logs above were for a single nflog that had a couple hundred alerts in it. I don't think there's a good way to limit the number of alerts we gossip, otherwise we're building a gossip that will always double-notify for large nflogs. I'm thinking a best-effort hashing-the-hashes might be the best option ..?

mxinden commented 6 years ago

extract comments out of the silencelog entry

This would mean silence comments would not be gossiped among Alertmanagers, right @brancz ? :confused:

I'm thinking a best-effort hashing-the-hashes might be the best option ..?

With my understanding of the code, a notification Entry's main portion consists of the hashed alerts. Thereby hashing the hash sounds dangerous to me.

mxinden commented 6 years ago

As far as I understand the above mentioned limit (1398) originates from the UDP buffer size (1400). According to the memberlist docs, a single gossip translates exactly into a single UDP package. Thereby increasing the limit by default would not be an option, as 1400 bytes is the industry standard as an MTU.

    // Maximum number of bytes that memberlist will put in a packet (this
    // will be for UDP packets by default with a NetTransport). A safe value
    // for this is typically 1400 bytes (which is the default). However,
    // depending on your network's MTU (Maximum Transmission Unit) you may
    // be able to increase this to get more content into each gossip packet.
    // This is a legacy name for backward compatibility but should really be
    // called PacketBufferSize now that we have generalized the transport.
    UDPBufferSize int

stuartnelson3 commented 6 years ago

This would mean silence comments would not be gossiped among Alertmanagers, right @brancz ?

They would have to be merged on a fullstate sync, which happens every 60 seconds. But that would be confusing for users.

stuartnelson3 commented 6 years ago

With my understanding of the code, a notification Entry's main portion consists of the hashed alerts. Thereby hashing the hash sounds dangerous to me.

What part of this sounds dangerous?

mxinden commented 6 years ago

What part of this sounds dangerous?

Doesn't hashing a hash increase the risk of hash collisions a lot?

stuartnelson3 commented 6 years ago

I'll be honest, I only read this response here: https://crypto.stackexchange.com/questions/12021/can-the-xor-of-two-non-collision-resistant-hashes-be-collision-resistant, from which it seems it's "probably ok".

Given that this entry is looked up based on groupkey/receiver, the issue would be if two groups of non-identical alerts for a single groupkey/receiver resulted in the same hash. So, in addition to the hash collision having to occur, the sample size of gossip messages that are this large and require the additional XOR'ing would hopefully be small enough that the odds of it happening are very small.

I'm not an expert on this, so anyone is welcome to chime in and correct me.

mxinden commented 6 years ago

As far as I understand the above mentioned limit (1398) originates from the UDP buffer size (1400). According to the memberlist docs, a single gossip translates exactly into a single UDP package. Thereby increasing the limit by default would not be an option, as 1400 bytes is the industry standard as an MTU.
    // Maximum number of bytes that memberlist will put in a packet (this
    // will be for UDP packets by default with a NetTransport). A safe value
    // for this is typically 1400 bytes (which is the default). However,
    // depending on your network's MTU (Maximum Transmission Unit) you may
    // be able to increase this to get more content into each gossip packet.
    // This is a legacy name for backward compatibility but should really be
    // called PacketBufferSize now that we have generalized the transport.
    UDPBufferSize int

Sorry, a UDP package greater than the MTU would be fragmented. This leaves us with a maximum UDP package size of 65,535 bytes (8 byte header + 65,527 bytes of data). Packet fragmentation might still be something we would like to avoid.

mxinden commented 6 years ago

Talking to @brancz in regards to double hashing: We had been facing issues gossiping only a single hash per notification, which resulted in duplicate notifications. This was fixed in https://github.com/prometheus/alertmanager/pull/703.

This PR introduces a fundamental change to the protobufs gossiped, because the previously optimization we implemented for the 0.5 release did not turn out to work as expected. Instead we now gossip the entire list of previously firing and resolved alerts, so if the currently firing alerts are a subset of the alerts that were firing at the last notification, then we can deduplicate. It works the same way for resolved notifications, if the currently resolved alerts are a subset of those that have already been notified about then they are deduplcated. Consequently, if currently firing or resolved alerts are not a subset of the previous sets, then they need to be notified about (after the group_interval).

stuartnelson3 commented 6 years ago

Hm, how do we fix the issue of gossiping a long list of alerts?

brian-brazil commented 6 years ago

Packet fragmentation might still be something we would like to avoid.

We're going over the internet, so that would be best. Is TCP an option?

stuartnelson3 commented 6 years ago

We're going over the internet, so that would be best. Is TCP an option?

The gossip messages (currently happening every 200ms) are over UDP, but the fullstate sync (every 60sec) use tcp.

stuartnelson3 commented 6 years ago

We could implement memberlist.Transport and use TCP instead of UDP for the gossip messages. With keep-alive, and the messages being sent every 200ms, we shouldn't have to re-establish the connection. Then we can raise the "packet" limit to something arbitrarily high.

https://github.com/hashicorp/memberlist/blob/master/transport.go https://github.com/hashicorp/memberlist/blob/master/net_transport.go

stuartnelson3 commented 6 years ago

Talking to @brancz in regards to double hashing: We had been facing issues gossiping only a single hash per notification, which resulted in duplicate notifications. This was fixed in #703.

We would be XOR'ing on top of a different hashing algorithm, I wonder if that would have an effect on collisions.

simonpasquier commented 6 years ago

We could implement memberlist.Transport and use TCP instead of UDP for the gossip messages. With keep-alive, and the messages being sent every 200ms, we shouldn't have to re-establish the connection. Then we can raise the "packet" limit to something arbitrarily high.

From my POV, this would be a good alternative. Another one would be to send messages larger than a predefined size using memberlist.SendReliable(). This would keep the existing gossip processing for smaller messages.

brancz commented 6 years ago

Implementing transport via TCP sounds like a good solution to me. I'd rather start optimizing for performance/throughput when we can actually quantify and justify it. Until then it's better to keep it simple stupid.

stuartnelson3 commented 6 years ago

I'd rather start optimizing for performance/throughput when we can actually quantify and justify it.

Which optimization are you referencing? I thought we were trying to figure out ways to gossip large messages since it currently isn't working.

brancz commented 6 years ago

Sorry I wasn't clear enough. I was commenting on @simonpasquier's suggestion to "keep the existing gossip processing for smaller messages". I think we're better off fully understanding and controlling one way of gossiping that we can improve when we can actually measure it, and my suggestion is to implement the memberlist.Transport with TCP for this.

stuartnelson3 commented 6 years ago

K, looking at this now.

stuartnelson3 commented 6 years ago

Alright, @grobie and I looked at the memberlist.Transport interface and its default implementation, memberlist.NetTransport.

The way memberlist uses the Transport interface assumes two different types of connections: one that's short-lived (in the default implementation, these are the TCP connections), and one that is a persistent stream (the UDP connection).

The short-lived connection is used whenever a user attempts to send a message reliably, or during the push-pull full state sync (both sending and receiving the sync). The methods for this are:

    // DialTimeout is used to create a connection that allows us to perform
    // two-way communication with a peer. This is generally more expensive
    // than packet connections so is used for more infrequent operations
    // such as anti-entropy or fallback probes if the packet-oriented probe
    // failed.
    DialTimeout(addr string, timeout time.Duration) (net.Conn, error)

    // StreamCh returns a channel that can be read to handle incoming stream
    // connections from other peers. How this is set up for listening is
    // left as an exercise for the concrete transport implementations.
    StreamCh() <-chan net.Conn

The persistent connection is handled by:

    // PacketCh returns a channel that can be read to receive incoming
    // packets from other peers. How this is set up for listening is left as
    // an exercise for the concrete transport implementations.
    PacketCh() <-chan *Packet

It's just used as a data stream that processes incoming Packets whenever they're received.

With the default implementation, both TCP and UDP listeners are bound to the same port, so it's very clear which data should be sent to which interface channels. The issue that we're facing because of only using TCP is how to differentiate between incoming message types. Options seem to be:

Have two ports: one port listening for the short-lived, net.Conn returning functions, and one port listening for gossip-level messages (which are then parsed, put into memberlist.Packets, and passed on). Alertmanager would then be using 3 ports in total on a host system.
Have one port: introspect incoming requests and then send them to the correct Stream or Packet channels. This will probably require a non-trivial amount of code to read the connection, parse what sort of message it is, and send it on appropriately.
Look into replacing memberlist with a higher-level library more suited to our needs. This would be a longer discussion, but I'm seeing a growing mismatch between the needs of alertmanager for HA mode, and memberlist. Memberlist is a low level building block and, as its name implies, seems to be focused on maintaining membership (hence its default use of small UDP packets for gossiping), vs. our needs of reconnecting, forwarding messages of arbitrary size, and sending messages to each peer (vs. the current probabilistic sending).
Split the difference between {1,2} and 3, and use memberlist.SendReliable() + memberlist.Members() to manually send messages to all cluster members. It would probably also make sense to aggregate and send messages every gossip-interval.

stuartnelson3 commented 6 years ago

After talking with grobie, the simplest way forward seems to be setting a limit on which messages we gossip (based on size), and anything above that gets sent via memberlist.SendReliable().

brancz commented 6 years ago

After talking with grobie, the simplest way forward seems to be setting a limit on which messages we gossip (based on size), and anything above that gets sent via memberlist.SendReliable().

I get it for moving forward and getting a stable v0.15 out, let's do it this way.

I still think we should do the transport implementation. Yes I think memberlist is a lot of work, but my feeling is that it is actually giving us all the control and understanding of how everything works, this is necessary for us to ever ship an actually stable version of Alertmanager with a stable wire protocol for example. I think a higher level framework would just drive us into the same problems where we don't know and can't control what's happening, and when it's breaking, well then we're breaking. I admit that I did not foresee the amount of customization that we are finding ourselves having to do with memberlist, but I feel that we actually have a solid understanding now of how everything works. In fact in my opinion it would be great if there wasn't a library at all and we would have all of the domain knowledge to produce this distributed system, but at the same time there is still a lot of functionality of memberlist that we actually are using (like all of the membership handling).

TCP for transport with a single port and deciding on message content based on magic bytes and headers is what the typical solution in a case like this would be and I think it's a good choice here as well, a higher level library would after all do something similar, just hide it and not give us control over it.

I don't think we're in a hurry with any of this, not even the v0.15 release, I think we're creating some pressure here ourselves (this is my point of view, other's might be different), it's ok that we are and we should be taking our time for such a drastic change in the system.

grobie commented 6 years ago

After having been pulled into this discussion and shown a lot of the interfaces and implementation in memberlist, I don't come to the same conclusion @brancz. Unfortunately memberlist is not that much of a low-level library, it is quite tailored for the specific needs of serf/consul, and the transport interface appears to be extracted after the fact. Memberlist has some very concrete asumptions about how the implementation behaves based on their own concrete UDP implementation.

Given the length we'd need to go to write a long-lived, single-port TCP transport against the given interface, I don't agree that memberlist provides enough control or allows for a simple and clean implementation in its current state.

Questions remain whether there is any other library which is better suited for our needs, or whether we're better of writing memberlist from scratch or refactoring memberlist (at this risk of creating a fork).

prometheus / alertmanager

Large nflogs/silences can never be gossiped #1412