mozilla / self-repair-server

This project is now EOL, replaced by Normandy Recipe Server.
6 stars 11 forks source link

Gentler heartbeat geo rollout #128

Closed gregglind closed 9 years ago

gregglind commented 9 years ago

Let's get Country Codes into Heartbeat without WRECKING EVERYTHING. (And solve one of glind's Q3 goals! Heros of Heartbeat Medals for Everyone).

Some things:

  1. Heartbeat (github.com/mozilla/self-repair-server) wants Geo.
  2. Other services have Geo (snippets etc.)
  3. I done broke things yesterday. (see https://bugzilla.mozilla.org/show_bug.cgi?id=1190500)
  4. bug filed to turn this into a central service. I don't really want to wait for it. https://bugzilla.mozilla.org/show_bug.cgi?id=1190910
  5. locale is a terrible proxy for geo. EN-US is the worst here. Only 1/3 of EN-US is actually in the US. Thus the cake heartbeat stats is a lie. Also, ps, 10% of the US GEO uses es-MX. So, I want to report about the US.

Current status:

Proposals to get over the day one hump (more welcome!):

  1. I do a phased deploy over 30 days. Unfortunate: It's a static file deploy on AWS, and I don't really have this set up.
  2. "phase it" in the file... like say "given that it's august, do probably of geo proportional to the day of the month" until it's all rolled out. This is doable, but really gross.
  3. MOAR POWER. A few more VM's or such for the first month, until it all settles down. I like this, because I HAVE TO DO NO WORK.
    1. Wait until it's in firefox central, let them handle it. When I take Dave Camp's job in 2114, this will be my first priority :) So, NO.

Asks:

Offers:

Thanks!

Gregg Lind User Advocacy Self-Repair Lead

willkg commented 9 years ago

For the record,

MOAR POWER. A few more VM's or such for the first month, until it all settles down. I like this, because I HAVE TO DO NO WORK.

Will likely result in another spike 30 days later when the caches run out. I don't suggest doing this unless the idea is to add a few more VMs permanently.

gregglind commented 9 years ago

(I agree there will be "rhyming" spikes at 30, 60. They should descrease in amplitude. This patch does the slow rollout over 30 days approach.

floatingatoll commented 9 years ago

I opened issue 129 to deal with the "rhyming" spikes issue. If a rhyming spike causes an outage, it will increase the amplitude of future spikes, rather than decreasing it. We need to random fuzz the '30 day' interval to prevent this. (Services Engineering has prior experience doing this with Sync clients and servers, for similar reasons.)

mythmon commented 9 years ago

I guess I'm too late, but I went one step farther than @willkg, and plotted the numbers: http://nbviewer.ipython.org/gist/mythmon/10584f8d2c60b05d3627

This probably isn't quite what you had in mind, as it still has a large group of people updating together. I played around with some other probability distributions, but I couldn't get anything that was flat, like we would want.