Service not scaling up workers to meet demand

robertknight commented 4 months ago

We're seeing an issue in the lms-viahtml environment where repeatedly reloading LMS assignments can run into a situation where requests start to hang and eventually timeout after 60 seconds. If these requests are for resources that block rendering of the page (eg. non-async scripts, CSS, fonts) this can cause the page content not to be rendered until the 60 second timeout elapses.

From the logs, it appears the issue is that workers are not being scaled up to meet demand.

web (stderr)         | 127.0.0.1 - - [2024-07-18 10:13:21] "POST /proxy/resource/postreq?url=https%3A%2F%2Fbam.nr-data.net%2Fjserrors%2F1%2Ffa4e8c4da8%3Fa%3D76030703%26v%3D1.262.0%26to%3DM1VTNRNXDUtZAkVRCgofdxQPVRdRVw8eVAgXHkcIBEEQFlkRWBYGBV5HABIYE1lfBEICNQVXVBIgZipuUQRGS0sUQl4ZGA%253D%253D%26rst%3D123357188%26ck%3D0%26s%3D098dbaa5812dcd37%26ref%3Dhttps%3A%2F%2Flms.hypothes.is%2Fapi%2Fcanvas%2Fpages%2Fproxy%26ptid%3Da90c938bd9b893f3&closest=now&matchType=exact HTTP/1.1" 200 2962 0.127615
web (stderr)         | [busyness] 1s average busyness is at 83% but we already started maximum number of workers available with current limits (17)
web (stderr)         | [busyness] 1s average busyness is at 82% but we already started maximum number of workers available with current limits (17)
web (stderr)         | [busyness] 1s average busyness is at 82% but we already started maximum number of workers available with current limits (17)

Here the 17 refers to the number of workers. See https://github.com/unbit/uwsgi/blob/52d858af7148a7084628d1241a7597441bb84f95/plugins/cheaper_busyness/cheaper_busyness.c#L249.

Slack thread: https://hypothes-is.slack.com/archives/C2C2U40LW/p1721245479213219

robertknight commented 4 months ago

The configured limits in https://github.com/hypothesis/viahtml/blob/main/conf/uwsgi/t3_small.ini are based on using a t3.small instance but we are currently using t3.medium's in production.

robertknight commented 4 months ago

The current uwsgi/conf/t3_small.ini configuration sets a hard limit (cheaper-rss-limit-hard) of 1536MB RSS usage for all workers.

There are several problems with this:

We moved from t3.small to t3.medium (4GB) since the config was created
Given the actual RSS usage per process, which is ~90MB, this equates to only ~16 workers rather than the configured limit of 32
RSS is not a great metric for per-worker memory usage, because a significant fraction of the memory is shared with other processes. From smaps_rollup for one of the worker processes in prod:

sh-4.2$ sudo cat /proc/29002/smaps_rollup
55d96f88f000-ffffffffff601000 ---p 00000000 00:00 0                      [rollup]
Rss:               92944 kB
Pss:               50012 kB
Shared_Clean:      10580 kB
Shared_Dirty:      35736 kB
Private_Clean:         0 kB
Private_Dirty:     46628 kB
Referenced:        67828 kB
Anonymous:         81656 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB

robertknight commented 3 months ago

Solved in https://github.com/hypothesis/viahtml/pull/697. We also increased the instance size from t3.medium to t3.large. With the changes in #697 we just about hit the memory ceiling of the t3.medium instance.

hypothesis / viahtml

Service not scaling up workers to meet demand #696