Periodic drops in Locate healthy instance count

cristinaleonr commented 1 year ago

On July 20th, we observed sharp, periodic drops in the Locate's healthy instance count since about June 6th.

Screenshot 2023-07-21 4 35 50 PM

Screenshot 2023-07-21 4 37 14 PM

The drops affect all experiments equally and are only visible in production (not in sandbox or staging).

Screenshot 2023-07-21 4 41 40 PM

The drops seem to happen on roughly a ~3 hour schedule, which lines up with the 3-hour synchronous clients that target the platform.

Screenshot 2023-07-21 4 45 27 PM

Screenshot 2023-07-21 4 45 38 PM

stephen-soltesz commented 1 year ago

Thinking about this, I realized in our first investigation we did not look at the response of these flash crowds to the number of App Engine instances. We're clearly hitting the upper limit for instance count (40) - And, I'm again struck by how the total recovery time takes about an hour, which is the same as App Engine's longest connection time. And, that the number of reported heartbeat connections doesn't have the same long recovery as health status does. What could cause this mismatch?

count (locate_health_status{experiment=~"$experiment.*"})

min by (experiment) (locate_health_status{experiment=~"$experiment.*"})

sum by (experiment) (locate_current_heartbeat_connections{experiment=~"$experiment.*"})

Zoomed in on one of the spikes, the healthy count drops before the instance count rises.

stephen-soltesz commented 1 year ago

I'm suspecting this might just be cpu exhaustion again. The config uses a 2 CPU VM. I recommend estimating the cpu time per request during steady state and seeing if the request rate during synchronous scheduling exceeds the expected capacity.

stephen-soltesz commented 1 year ago

Roughly about 30% of the time (one hour every 3 hours), we have about ~5-10% reduction in reported available servers for up to an hour, with instantaneous availability dropping as low as ~50% over shorter 4-5min windows.

cristinaleonr commented 1 year ago

The CPU utilization shown on App Engine does not seem to go above 75%.

Screenshot 2023-08-02 10 44 50 AM

It does seem to show the periodic scaling events, which would cause a lot of the Heartbeat instances to reconnect to the new Locate instances.

Screenshot 2023-08-02 11 09 52 AM

It also looks like some of the health transmissions from the Heartbeat and the writes to Memorystore from the Locate are taking a bit longer than expected (not sure I see any periodic patterns in the former, though).

Screenshot 2023-08-02 10 48 10 AM

Screenshot 2023-08-02 10 49 51 AM

So, what I think is happening is that high latency for some of these operations is causing the entries to expire, but the connection is not being terminated because the operations are succeeding (no error &g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1d) is returned). This would explain why the number of heartbeat connections recovers quickly, but the number of healthy instances drops and does not recover until the connection is killed and the registration is re-sent.

stephen-soltesz commented 1 year ago

Thank you for the update! Ooh, okay; so the registration is removed but health updates continue? Is it possible to make health updates without a remaining registration an error case?

Other questions/thoughts:

I'm skeptical of the average cpu utilization. I can see individual instances with much higher utilization with prometheus metrics. Could the flash crowd be impacting a subset of the instances (and this why a small percentage of hosts are affected)?
Why would more instances cause established heartbeat connections to reconnect?
Is the top line of the "Heartbeat health transmission time" panel correct? It's filled with over 58sec events.

stephen-soltesz commented 1 year ago

Also a general thought / observation, the unintended interaction between user-requests and platform-requests (the latter being vulnerable to the former) seems like an anti-pattern. Ideally, these would be independent of one another. This could be an argument for separating the request handling in different services for the different cases.

cristinaleonr commented 1 year ago

Yes, I have made the updates without a registration an error case (https://github.com/m-lab/locate/pull/154). We've also increased the CPU (https://github.com/m-lab/locate/pull/153).

The recovery time has been reduced from ~1 hour to 5 mins. Screenshot 2023-08-17 9 45 51 AM

As App Engine scales, I would imagine it reallocates the sessions to balance the load. There is some documentation that suggest this is the case.
The query for the "Heartbeat health transmission time" panel seems correct. Let me know if you see a problem with it.

sum by(le) (increase(heartbeat_health_transmission_duration_bucket{workload=~"$experiment.*"}[5m]))

I see mostly durations in the 0-2 seconds bucket now (though there are a few in the 30+ one). Screenshot 2023-08-17 9 46 26 AM

Agreed, we should prioritize separating user requests from platform ones.

mattmathis commented 1 year ago

Here is a query to identify clients with simple Time-of-Day schedules.
https://console.cloud.google.com/bigquery?sq=754187384106:6748707322b04693a2d74358ea40a7fd Clients near the top of the list are likely to be causing us harm, clients further down not so much. It does not check to see if the tests are run in the same minute of each hour. I could add this (and other) properties if they would help.

I excluded from the query a large number (thousands) of windows clients that run 1 or 2 tests per day against many servers. I am guessing this is a connectivity monitoring tool and that the tests are not synchronized with other clients (I saw this behavior in other beacon data), LMK if there are other scoring properties that might be helpful.

See the last subquery for documentation on the columns.

cristinaleonr commented 1 year ago

The following query joins 3-hour periodic client data from measurement-lab.ndt.ndt7 with user agent data from the Locate ("2023-08-28").

WITH periodic AS (
  SELECT raw.ClientIP, COUNT(*) AS testCount, TIMESTAMP_TRUNC(a.TestTime, HOUR) AS roundTime
  FROM `measurement-lab.ndt.ndt7` 
  WHERE date = "2023-08-28"
  AND (TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )
  GROUP BY roundTime, ClientIP
  ORDER BY testCount DESC
),

windows AS (
  SELECT ClientIP, COUNT(*) AS windowCount, SUM(testCount) AS testSum, ARRAY_AGG(STRUCT(roundTime, testCount)) AS periods
  FROM periodic
  GROUP BY ClientIP
  ORDER BY windowCount DESC
),

locate AS (
  SELECT httpRequest.remoteIp, ARRAY_AGG(DISTINCT httpRequest.userAgent IGNORE NULLS) AS userAgent
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230828`
  GROUP BY httpRequest.remoteIp
)

SELECT ClientIP, testSum, userAgent, periods
FROM windows INNER JOIN locate ON (ClientIP = locate.remoteIp)
WHERE windowCount = 8
ORDER BY testSum DESC

cristinaleonr commented 1 year ago

The integration responsible for the 3-hour scheduled requests has been identified as that using the "ndt7-client-go-cmd/0.5.0 ndt7-client-go-cmd/0.5.0" user agent string (graph).

Screenshot 2023-09-07 5 31 48 PM

The following queries show that the requests are coming from 7876 different IPs, with a maximum of only 28 requests per IP/minute. So, it's not possible to just limit a specific set of IPs using this user agent. A more generic rate limiting solution will need to be implemented (e.g., setting a limit on the number of requests/minute for any client).

SELECT httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY httpRequest.remoteIp
ORDER BY requests DESC

SELECT TIMESTAMP_TRUNC(timestamp, MINUTE) AS time, httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY time, httpRequest.remoteIp
ORDER BY requests DESC

cristinaleonr commented 12 months ago

Number of tests sent per day by the integration during the 10 minute slots of periodic requests every 3 hours: 76,497 (~2% of total tests).

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
  AND (TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )

cristinaleonr commented 12 months ago

In total, tests with the "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0" user agent string represent about 80,000 tests per day.

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"

mattmathis commented 12 months ago

I wrote a query to detect all ndt7 clients that don't seem to be randomizing their run times (e.g. a large fraction of their tests are in roughly the same minute). BigQuery Client amd64,ndt7-client-go,0.4.1,ndt7-client-go-cmd,windows,0.4.1 is not just hitting hard, it is also hitting 9-11 different sites. There are lots of other clients that are also using poorly randomized timers

mattmathis commented 12 months ago

Here is more information on the BigQuery

It intended to has some additional (manually curated) filtering and then be used to generate a pattern matcher, to block requests early in the process. The Output columns are:

ClientIP - yes that IP address
clientFingerPrint - Concatenated client metadata
totalTests per ClientIP*clientFingerPrint
- (For ist and (none),this is likely to include more than one actual client)
minuteBin - Minute of the tests rounded to the nearest 5 minutes (-1, +4)
- all subsequent columns are grouped by ClientIPclientFingerPrintminuteBin
hours - number of active hours per day
hourBits - Bitwise encoding of active hours. 0x1 is 0:00-0:59 UTC
tests - count
Metros - Number metros tested
Site - Number of sites tested
exampleSite - one randomly chosen exampleSite (not stable)

cristinaleonr commented 12 months ago

Change https://github.com/m-lab/locate/pull/162 was released yesterday at EOD. We did not see any periodic drops in the Locate's healthy instance count today.

Screenshot 2023-10-03 3 49 55 PM

m-lab / locate

Periodic drops in Locate healthy instance count #149