Facebook outage, Locate demand surge, NDT Test volume surge

The global Facebook outage began around 2021-10-04T15:40 UTC.

In response to this outage, many users sought out the NDT speed test hosted on M-Lab. This surge of traffic appears to be the largest flash crown in our history, peaking at 1600 req/sec. The initial traffic caused our Locate API to reject a significant number of requests for the first hour. Smaller events were amplified by high request rates resulting in client visible failures (and fewer or no user measurements). And, ultimately a default, daily quota for App Engine was exhausted for about 20 minutes, during which no requests (or measurements) succeeded.

Four periods of interest:

First hour. Between 15:40 and 16:50 UTC, the Locate API struggled to respond to the largest flash crowd in our history. Logged, inbound requests peaked at about 1600 req/sec. Unfortunately, during that first hour at least 50% of requests received 502 Bad Gateway errors due to the backend services not scaling up quickly enough or stabilizing under heavy load.
First glitch. Around 18:30 UTC, test rates dropped to near zero for about a minute.
Second glitch. Around 21:45 UTC, tests rates dropped to near zero for about 5 minutes.
Quota exceeded. Between 22:45 UTC and 23:05 UTC the App Engine daily quota was exceeded and 100% of requests failed.

m-lab / data-annotations

Facebook outage, Locate demand surge, NDT Test volume surge #26