Garbage Collection of old Rumors

habitat-sh / habitat

Modern applications with built-in automation

https://www.habitat.sh

Apache License 2.0

2.61k stars 316 forks source link

Garbage Collection of old Rumors #5761

Open christophermaier opened 6 years ago

christophermaier commented 6 years ago

The Supervisor currently remembers rumors about every other Supervisor it has ever been in contact with, even if those Supervisors have left the network and will never return. This is particularly troublesome in automated environments in which servers may be provisioned and destroyed with some regularity. Not only will the destroyed Supervisors never again participate in the network, but the creation/destruction cycle will create many such Supervisors over the lifetime of a Habitat network. And because if even a single Supervisor knows a rumor, it will rapidly be shared across the network. This means that the only way to currently purge knowledge of long-dead Supervisors is to shut down all Supervisors, delete their /hab/sup/default/data/*.rst files, and then restart them all (the *.rst file is a binary serialization of the rumors the Supervisor has heard).

The problem of keeping old rumors around isn't necessarily in the storage space they consume (it's not especially large, even for large networks... we've seen data files of 30MB for networks of 3500 Superviors), it's in the data structures we fill based on these rumors and the other reachable Supervisors that we send them to.

https://github.com/habitat-sh/habitat/pull/5744 went a long way toward reducing this overhead, by removing many of the entries in this data structure pertaining to Supervisors that have departed (whether by gracefully shutting down, timing out, or being manually departed using hab sup depart). It did not remove everything, though, and more importantly, did not remove any of the underlying rumors that get persisted in the *.rst file. When this file is read in the next time the Supervisor starts up, we'll still add these rumors back to this data structure (though it won't be as large as it would have been before, since the population of the data structure is partly driven by which Supervisors are currently reachable; departed Supervisors are not reachable, by definition). This also results in additional network traffic, as we'll tell everyone in the network about these departed Supervisors, even though they likely already know.

Though we need to keep information about departed Supervisors around for some amount of time, in order to ensure that the information has made it to other members of the network, we should be able to intelligently dispose of it after a sufficient amount of time. After all, in automated environments as described above, there is no way these Supervisors are ever coming back.

We already wait 3 days before automatically declaring a non-contactable Confirmed Supervisor Departed; we could simply wait longer (not necessarily 3 days... just longer) before removing pertinent rumors from our internal lists of rumor, which would result in persisting a smaller amount of rumors to disk. The amount of time to wait could also be configurable with environment variables / options, etc.

By taking care to prune our rumors, we can keep our memory consumption proportional to only what Supervisors are currently active in a network, rather to how many Supervisors have ever been in that network.

There are also other rumors we might clean up, such as configuration rumors and file rumors, but from a space/memory consumption standpoint, these pale in comparison to rumors about Supervisors.

thomascate commented 5 years ago

Another feature that could help quite a bit with this would be to add the ability to remove a supervisor from the rumor data via CLI or API. This way if a user regularly decommissions boxes in say a red/green deployment, they could add automation to clean up nodes that they know aren't coming back.

echohack commented 5 years ago

Though we need to keep information about departed Supervisors around for some amount of time, in order to ensure that the information has made it to other members of the network, we should be able to intelligently dispose of it after a sufficient amount of time.

In the worst case scenario, departing a node in the network could cause another node to go offline (though unlikely, someone COULD set an event to fire on departure...). You would then need to wait 72 hours for this node to depart in case it needs to receive the rumor if it comes back before then. So the minimum time to remove rumors would be 72 hours * 2... yes?

thomascate commented 5 years ago

What's the worst case scenario here if you remove a nodes rumor prematurely? If it comes back it will feed it's UUID and information back in moving it to active. If it was kept around in departed and it came back, it would have the exact same end state, but with more data kept around in the interim.

raskchanky commented 5 years ago

After studying this code for a bit and chatting with @christophermaier, here are my thoughts on how to proceed:

We have a process that checks for members that have been Departed for X amount of time. X will be configurable via an env var, but will likely default to something like 4 hours.
For each of the members found, purge all rumors in all rumor stores.
There's no need to send out any rumors as part of this process to let other supervisors know what we're doing. If each supervisor runs this process, then (in theory anyway), all of them will converge on the same set of rumors for removal.
We already have a dedicated butterfly thread for transitioning nodes from Suspect -> Confirmed and Confirmed -> Departed, and this is the natural extension of that: Departed ->. It makes sense to keep this additional capability in that thread.

Am I missing anything?

thomascate commented 5 years ago

My only concern is that you could end up with different members transitioning nodes past the departed status at different times. Due to members not having consistent environment variable settings.

If that's a concern I can think of a couple ways around it

Make a cluster wide variable that is spread through the gossip ring
Attach the environment var to the nodes gossip data and use it to inform other nodes when it's safe to reap this member. This way you can have different polices say for ephemeral app servers and database servers.

raskchanky commented 5 years ago

In order for the transition timing to be different on different nodes, someone would have to manually make it that way, by intentionally setting said environment variable to different values for different nodes. That strikes me as unlikely, although even if it were to occur, I can't think of any issues that it would cause off the top of my head.

christophermaier commented 5 years ago

The butterfly thread that keeps track of the transition timings does not persist timing information, so you could theoretically get into a situation where a Supervisor that restarts more frequently than every 4 hours could keep departed rumor information "forever". Now that I think about it, I'm not sure if we do anything right now that, say, populates that timing data for members that are Suspect or Confirmed at startup... 🤔 It may be that we could just say "if I start up and you're already departed, I'm deleting you" (alternatively, "if I'm shutting down and you're departed, I'm deleting you"), though I think everyone else might just tell us about them again.

(I still think it's fine to use that expiration thread, and I'm not sure that persisting that timing data to disk is worthwhile at this point, but the behavior is something to consider.)

I think for an initial pass, keeping rumor deletion a purely local thing is the most straightforward. If it proves to be a problem, we could look at modifications like "cluster wide variables", or modifying the rumor itself to include an "deletion timeout" (which is an interesting idea), but both of those would require extra work.

baumanj commented 5 years ago

We have a process that checks for members that have been Departed for X amount of time. X will be configurable via an env var, but will likely default to something like 4 hours.

Why 4 hours? The rumor that a node N is departed should propagate through the supervisor network in much less time than that. Or, if the 4-hour threshold occurs during a network partition, nodes on the other side will not be informed upon a rejoin and may then propagate a bunch of rumors about N that the members who've purged all knowledge of N will be obliged to circulate since the information they need to refute them is gone.

For each of the members found, purge all rumors in all rumor stores.

More detail about what specific rumors we're talking about would help. Does this have any effect on SWIM?

Am I missing anything?

I think this is a good first step, but for a full solution, I think we need something more. Consider a node running a service S that goes away. If the node is still Alive, there's no indication to the rest of the supervisor network that the rumors about service S need to be removed. I think we should consider whether applying expiration times to rumors at creation would address the issue more broadly.

raskchanky commented 5 years ago

@baumanj 4 hours was an arbitrary number I made up, aiming for something that was not too short and not too long. 😁 I'm happy to change it to some other value if that works better. Why did we pick 3 days for a node to be Confirmed before it transitions to Departed? That kind of timing felt too long for this case (it feels too long in that case too, but I don't know the reasoning behind it).

The specific rumors I'm thinking of in this case are everything we push/pull via zmq, concretely this list. It has an effect on SWIM in the sense that members will be getting removed from the MemberList but those members are Departed anyway and shouldn't be participating in SWIM traffic.

Consider a node running a service S that goes away. If the node is still Alive, there's no indication to the rest of the supervisor network that the rumors about service S need to be removed.

This piece confuses me a little bit. I thought this task was about removing rumors for members of the network that are Departed, which by definition means they're not going to return. If the node is still Alive, then we shouldn't be removing anything. Am I misunderstanding your point?

I think applying expiration times to rumors at creation time is a great idea. How do we determine what an appropriate expiration time would be for rumors?

baumanj commented 5 years ago

Why did we pick 3 days for a node to be Confirmed before it transitions to Departed?

I've always assumed it was the amount of time after which the likelihood between accidental net-split and intentional decommissioning of nodes was overwhelmingly in favor of the latter. That seem like a logical criterion for making a Confirmed member Departed: this node is never coming back.

For the garbage collection case, since we're only doing it after the point of no return for node N, what is the wait time criterion? Why not purge the rumors about node N as soon as its health is updated to Departed? My assumption was that the 4 hours was to ensure that the rumor of N's departure had propagated across the entire supervisor network. Before that time, while rumors that refer to N are still circulating, it is useful to have the information that N is a known, Departed entity, so rumors concerning it can be appropriately ignored rather than propagated.

I have more thoughts, but they'll have to wait until next week.

baumanj commented 5 years ago

The specific rumors I'm thinking of in this case are everything we push/pull via zmq, concretely this list.

For each of those kinds, there's some additional thinking to do about when purging should occur:

Departure is pretty simple since it just contains a single member ID
Election is a bit more complicated. Surely we'd remove the rumor that makes a departed member a leader, but what if the departed member is merely one of the votes? Similarly for ElectionUpdate.
Membership is simpler because it only deals with one member
Service is referring to a single member, but is any of the associated configuration of interest to other members following departure?
ServiceConfig and ServiceFile originate with a particular member, but are of interest to the whole service group, so those probably can't be purged when the member that originated them departs. When can they be purged?

Consider a node running a service S that goes away. If the node is still Alive, there's no indication to the rest of the supervisor network that the rumors about service S need to be removed.

This piece confuses me a little bit. I thought this task was about removing rumors for members of the network that are Departed, which by definition means they're not going to return. If the node is still Alive, then we shouldn't be removing anything. Am I misunderstanding your point?

I bring this up as an example of rumors that we would want to be purged that couldn't be purged in the context of a scheme based on node departures. This is just to say that we need to consider how we'll address those cases as well. Maybe it means we do multiple things, or maybe it means we should look for an approach that would be more broadly applicable than purging based on member departure.

I think applying expiration times to rumors at creation time is a great idea. How do we determine what an appropriate expiration time would be for rumors?

I think we have to think through the various rumors and understand the use cases. The tradeoff here is between a too-short time which would require excessive network traffic and a too-long time which would result in excessive wait times to recognize the state of the rumors has changed. For something like service rumors, I think we'd want something between 1 minute and 1 hour, but for configuration or elections, I think they could live much longer. The new fundamental concept is rumors needing to be "refreshed" to indicate they should continue to live. If rumors themselves have IDs, we could probably get away with piggybacking them on SWIM messages and get those refresh semantics essentially for free. If that's the case, we could potentially make the lifetimes quite short. We'd need to think about how this would interact with network partitions, though.

raskchanky commented 5 years ago

So, thinking about this less in terms of node departure and more in terms of expiring rumors, here are some thoughts I had.

We add IDs to rumors. Note that the Rumor trait already has an id function, but that's referring to something else. What I'm referring to here would be more like the UUID that Members have.
We add an expiration date/time to rumors.
The expire thread that we already have adds expiring rumors to its list of duties.
The lifespan of any given rumor would vary by rumor type.
I think that piggybacking rumor IDs onto SWIM messages to indicate which rumors should be refreshed would likely work, provided there's enough room. The reading I did suggested that keeping UDP packets to a maximum of 512 bytes was a smart idea, as that's the size packet that DNS uses. Any larger and you risk packet fragmentation. Assuming the supervisor metrics are accurate, I was seeing SWIM packet sizes around roughly 120 bytes in my tests, which leaves us almost 400 bytes to play with.
Assuming that the SWIM message piggybacking doesn't work, either due to space constraints or something else, I think we can use the same rumor sending mechanism that we have today for sending out refreshed rumors. However, if we go this route, I think the heat mechanism that we have for rumors ends up being useless. The Xerox paper describes the heat as how many times the rumor should be sent to the network at large, but in butterfly today, it represents how many times we send a rumor to each member. If we're going to be sending rumors out periodically anyway, to ensure they're refreshed and not expiring, then I can't see how the heat mechanism helps us at all. It should likely just go away entirely.

Some questions that came up as I thought/read about this:

What do we do with the rumors that already exist in rst files of deployed supervisors, that don't have expiration dates? One thought I had was to treat them as though they had just been received over the network, apply an expiration date/time on them appropriate for their rumor type, and let them expire naturally from there. There might be better ways to deal with this.
What happens to rumors during a network partition? I can imagine a few scenarios here:
- A brief network partition occurs, specifically one that is shorter than the shortest expiration time we have. In this case, nothing bad happens because things heal in time for refreshed rumors to arrive.
- A network partition occurs and it lasts longer than some expiration times we have. In this case, a certain subset of rumors will get purged, specifically ones that originated from supervisors on the other side of the split from us. In this case, we lose all that information. I would imagine that all of it would come back, though, after the partition heals. There are probably some ramifications to this that aren't immediately obvious to me.

I haven't come up with specific expiration times for each rumor type yet, but I'd appreciate any feedback on this approach so far. @baumanj @christophermaier

baumanj commented 5 years ago

I think the heat mechanism that we have for rumors ends up being useless

Agreed. I think we want to move away from that rumor mongering approach to an anit-entropy one anyway. We need that to ensure that we have a guarantee (if asymptotic) on full dissemination of rumors to all members. But that's a separate issue.

I like the idea of SWIM piggybacking for efficiency, but I wonder (even as the person who suggested it originally) if it's premature optimization. Since it is a bit of a layering violation, maybe we should make sure using a regular, separate rumor message for refreshing isn't good enough.

What do we do with the rumors that already exist in rst files of deployed supervisors, that don't have expiration dates?

As you suggest, we could treat them like just received and I think that would work ok. Additionally, we could ignore them completely: if the rumor is still relevant we'll be getting refreshes of it. It occurs to me that we'll need to have a mechanism for when a member receives a refresh for a rumor it doesn't already know about to ask the sender of the refresh message for the full rumor. Additionally, a mechanism to request a full dump of rumor state from a (likely persistent) peer upon joining a new network is something we need, but that's a separate issue that I'll be looking at shortly.

In any case, this issue of a persisted rumor that lacks an expiration will only be an issue for a short period as we transition into this new approach.

A network partition occurs and it lasts longer than some expiration times we have. In this case, a certain subset of rumors will get purged, specifically ones that originated from supervisors on the other side of the split from us. In this case, we lose all that information.

One approach here would be pausing the expiry of rumors that originate with members that are Suspect or Confirmed until they become Departed. At that point, we'd purge all the rumors we have which were originated by that member regardless of their expiration*. Once the partition heals, we can push forward the expiration date on any rumors that originated from members we were split from to ensure there's enough time to see a refresh message. I haven't thought too much about the specific times here, but we'd likely want to keep track of the expiration time in terms of a (time_to_live: Duration, last_refresh: Instant) pair rather than just a single time_to_expire: Instant, so that we can easily extend the lifetime of a rumor by an appropriate amount even if we're partitioned from its originating member. However, we should also be careful that we don't end up with a system that extends the lifetimes of rumors indefinitely in the case where brief partitions are common. That may require keeping track of the difference between a "true" refresh based on a message from the originator and an "artificial refresh" due to our partition logic and limiting the latter.

Also, perhaps I'm overcomplicating this and we should start simple. If we expire a rumor due to a partition, once we come back we'll eventually see the rumor again. The normal dissemination algorithm will apply to the refresh messages, so improvements there will mean improvements to our healing from partition.

I think we need to consider what is the behavior we want to avoid. How bad is it for us to expire a rumor in some parts of the network and then have it come back later? Where's the right tradeoff for keeping a stale rumor around versus expiring one prematurely? If rumor expiration is mostly about preventing our persistence from being unbounded, we can probably keep most rumors around for as long as it takes a member to depart. If we go that route, do we even need to worry about premature expiration on splits? (Technically I think we'd need to keep the rumors around for 2 x departure delay + minimum propagation time).

* Part of this work will mean possibly getting rid of the pattern where a member that is not part of a service group can inject configuration for it before any service group members join. We probably need the concept of a configuration rumor being owned by the entire service group rather than the particular member who originated it. In that case, refreshing it should be the joint responsibility of the service group members, and once they all go away, the relevant rumors will eventually expire.

raskchanky commented 5 years ago

I like the idea of SWIM piggybacking for efficiency, but I wonder (even as the person who suggested it originally) if it's premature optimization. Since it is a bit of a layering violation, maybe we should make sure using a regular, separate rumor message for refreshing isn't good enough.

I think it'd be good to not do the piggybacking straight away, mostly for simplicity's sake, and see how that shakes out.

As you suggest, we could treat them like just received and I think that would work ok. Additionally, we could ignore them completely: if the rumor is still relevant we'll be getting refreshes of it.

Ignoring them completely is definitely easier, although your point about receiving refreshes of them makes me realize that this is less of a big deal than I originally thought.

It occurs to me that we'll need to have a mechanism for when a member receives a refresh for a rumor it doesn't already know about to ask the sender of the refresh message for the full rumor.

What if, instead of sending a new kind of rumor specifically for refreshing an existing one, we just sent the original rumor again? The main disadvantage that occurs to me here is message size, since specialized refresh rumors would be smaller than the actual rumor they represent. I'm guessing most rumors are fairly small, but for things like file uploads, the savings would likely be significant.

Also, perhaps I'm overcomplicating this and we should start simple. If we expire a rumor due to a partition, once we come back we'll eventually see the rumor again. The normal dissemination algorithm will apply to the refresh messages, so improvements there will mean improvements to our healing from partition.

I think it might behoove us to start simple and see if that's good enough. If it's not, and we're seeing obvious issues, then apply more complicated logic to sort it out. I like your suggestions about pausing though.

How bad is it for us to expire a rumor in some parts of the network and then have it come back later?

Off the cuff, this doesn't seem like it would be a big deal.

Where's the right tradeoff for keeping a stale rumor around versus expiring one prematurely?

It's possible I'm thinking about this wrong, but I've always thought of rumors as something that should be ephemeral and easily replaced. It's a mechanism for communicating information between supervisors in a structured way, but I don't think they should be considered sacred. I'd rather see us expire one prematurely than keep a stale one around too long.

If rumor expiration is mostly about preventing our persistence from being unbounded, we can probably keep most rumors around for as long as it takes a member to depart. If we go that route, do we even need to worry about premature expiration on splits? (Technically I think we'd need to keep the rumors around for 2 x departure delay + minimum propagation time).

I think part of it is about preventing our persistence from being unbounded, but another part of it is reducing network storms when new supervisors come and go from the network. It's a waste of CPU time and network bandwidth to be sending rumors that are no longer relevant back and forth.

baumanj commented 5 years ago

What if, instead of sending a new kind of rumor specifically for refreshing an existing one, we just sent the original rumor again? The main disadvantage that occurs to me here is message size, since specialized refresh rumors would be smaller than the actual rumor they represent. I'm guessing most rumors are fairly small, but for things like file uploads, the savings would likely be significant.

Sending the whole rumor again sounds like a good thing to start with for the sake of simplicity to see how it works out. We can always add the optimization later if it's needed.

If rumor expiration is mostly about preventing our persistence from being unbounded, we can probably keep most rumors around for as long as it takes a member to depart. If we go that route, do we even need to worry about premature expiration on splits? (Technically I think we'd need to keep the rumors around for 2 x departure delay + minimum propagation time).

I think part of it is about preventing our persistence from being unbounded, but another part of it is reducing network storms when new supervisors come and go from the network. It's a waste of CPU time and network bandwidth to be sending rumors that are no longer relevant back and forth.

I think the main fix for the network storms is going to be an efficient mechanism for getting a new peer "up to speed". Even if we had a very short timeout that ensured only relevant rumors were extant in the network, the existence of large config file rumors for active services could make this a problem. Why don't we start with using expiration to only target the unbounded storage issue (i.e, use long expiration) and when we implement the other approaches* to addressing the join storms we can see if there's additional value to be gained from a tighter expiration bound.

In general, I really like the direction this is headed. Starting simple should help make this into something we can try out and get more experience with quickly.

* More targeted sending, bulk rumor updates

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.