Closed robertknight closed 3 months ago
The configured limits in https://github.com/hypothesis/viahtml/blob/main/conf/uwsgi/t3_small.ini are based on using a t3.small instance but we are currently using t3.medium's in production.
The current uwsgi/conf/t3_small.ini
configuration sets a hard limit (cheaper-rss-limit-hard
) of 1536MB RSS usage for all workers.
There are several problems with this:
smaps_rollup
for one of the worker processes in prod:sh-4.2$ sudo cat /proc/29002/smaps_rollup
55d96f88f000-ffffffffff601000 ---p 00000000 00:00 0 [rollup]
Rss: 92944 kB
Pss: 50012 kB
Shared_Clean: 10580 kB
Shared_Dirty: 35736 kB
Private_Clean: 0 kB
Private_Dirty: 46628 kB
Referenced: 67828 kB
Anonymous: 81656 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
Solved in https://github.com/hypothesis/viahtml/pull/697. We also increased the instance size from t3.medium to t3.large. With the changes in #697 we just about hit the memory ceiling of the t3.medium instance.
We're seeing an issue in the lms-viahtml environment where repeatedly reloading LMS assignments can run into a situation where requests start to hang and eventually timeout after 60 seconds. If these requests are for resources that block rendering of the page (eg. non-async scripts, CSS, fonts) this can cause the page content not to be rendered until the 60 second timeout elapses.
From the logs, it appears the issue is that workers are not being scaled up to meet demand.
Here the
17
refers to the number of workers. See https://github.com/unbit/uwsgi/blob/52d858af7148a7084628d1241a7597441bb84f95/plugins/cheaper_busyness/cheaper_busyness.c#L249.Slack thread: https://hypothes-is.slack.com/archives/C2C2U40LW/p1721245479213219