Open jm-welch opened 8 years ago
Just found another issue with the VO_INCIDENT_KEYS
list - it's not filtering out duplicates. If an incident receives 3 updates, all three of those will be present in the list. This may not fill a disk, but it is taking up a lot of extra space that doesn't need to be taken. Also, since the older message will be removed from the INCIDENT_KEYS list after 24 hours, this would also remove that incident from the main list at that time, even if it had been updated more recently.
I think I have a full-on fix for #14 ready to go - doing final testing now. If it works, this issue won't be needed anymore, as the cleanup will be handled differently.
This one's going to be long, but hopefully, I've got a solution. I was digging around in the depths of redis to figure out how best to store control call data, and I discovered that there are a couple of problems with the current handling of the incident list in the redis brain.
When I looked at the redis data block, I saw the following:
(Except there were lots more incidents tracked). So this means that hubot doesn't use redis at the top level, but rather creates a json block stored as the value for the
hubot:storage
key, and@robot.brain.set
is setting values under_private
within that JSON block.Notice the odd thing there?
VO_INCIDENT_KEYS
is blank. That's not a redaction, it's the observed behavior. And it stayed that way. I ranredis-cli monitor
to keep an eye on it, and through every update, that value stayed a blank list. Commenting out the call tocleanupBrain
"fixed" that, so I figured I'd do some more digging into that method to see what else might be odd there.Here's the current implementation (in case it gets lost to the sands of time):
The
new Date(item.timestamp)
creates a date object, which thegetDate()
method converts to an integer representing the day of the month (1-31) (reference). This means a couple of things.new Date
on the other side of the comparison is being compared to a raw integer. This isn't going to go well. Because of this, every timecleanupBrain()
is called, the entirety ofVO_INCIDENT_KEYS
gets filtered out, since the integer is always less than the Date object.new Date().getDate()
doesn't work. The.getDate()
on the left side means that the +1 isn't adding a day, it's adding a day-of-month, so any items created on the 31st of a given month will never be removed (there will never be a 32nd of the month), and objects created on the 30th won't qualify for removal until the 31st of the following month.The fix for this is to use the
.getTime()
method. This converts the date to an epoch time instead, so instead of adding 1 to stand in for 24 hours, we add 86400000 (milliseconds in a day). This works out to something like:The second issue I found is one that we actually just removed the code for, without actually fixing it. It used to look like this:
Now that I've done some more digging into what's going on in the redis brain, I can see what was going on in the
#@robot.brain.remove item.name
line. Up inreceiveWS
, under theTIMELINE_LIST_REPLY_MESSAGE
block, the incident is being written two places:... and ...
The first call, above, adds the full alert JSON directly under
_private
, while the second call manipulates the contents list stored (at the same level) asVO_INCIDENT_KEYS
.This means that, by dropping the
@robot.brain.remove item.name
line, we rendered the bot incapable of cleaning up the full JSON of any alert it has received, while the list that was intended to be used to prune the larger body of data is being constantly purged (meaning that there is no way short of manual manipulation to remove those entries). This means that the size of the redis brain has the potential to get quite large over time, as no old entries will ever be removed.Based on this, it looks like we really do need the
@robot.brain.remove
line, so I uncommented it, and now that we've got everything else going for us, I think we're in good shape.I'm working on testing this at the moment to make sure that it really does work now, before I submit a PR. I'll be able to change the 'timeout' value from 1 day down to a few hours tomorrow morning to see if my test alert that I just created gets dropped from the brain.