openwallet-foundation / acapy

ACA-Py is a foundation for building decentralized identity applications and services running in non-mobile environments.
https://aca-py.org
Apache License 2.0
412 stars 512 forks source link

Excessive throttling when running ACA-py agents containerized cloud environments #2157

Closed jleach closed 10 months ago

jleach commented 1 year ago

TL;DR

I've noticed that a number of our agents, mediators or otherwise, have a very high amount throttling when run in a containerized cloud environment. This would be OpenShift in our case, but more broadly impact any Kubernets platform. In this case throttling is when the process didn't get all the CPU time it needed to complete its work.

Scheduling on Linux and Kubernets

You cannot be on a fractional part of a CPU core. You're either on a core or you're not. Modern Linux systems, and by extension Kubernets work around this by converting your CPU into time slice. When a containers starts it is given an infinite amount of 100ms blocks of time. Your container resource limits, like limit.cpu is converted into its fractional mount of that 100ms block of time. For example, 100ms is 1/10 of a vCore which is 10ms of run time. Out of the 100ms block of time, the container has 10ms to be running on a physical CPU core. The other 90ms is waiting time where it is unscheduled and waiting for the next 100ms block of time.

Note: This applies to any container run-time. Its all managed by cfgroups in Linux as this is not a Kubernets specific feature. Containers are just three linux features working together (cfgroups, namespaces, and UFS).

Throttling

I noted above that throtteling means you needed more time to do work but were un-scheduled by the Linux scheduler. This situation is tracked by linux in the following file:

1004450000@aries-mediator-agent-35-77gk5:~$ pwd
/home/indy
1004450000@aries-mediator-agent-35-77gk5:~$ cat /sys/fs/cgroup/cpu/cpu.stat
nr_periods 19040
nr_throttled 4905
throttled_time 100800986818
1004450000@aries-mediator-agent-35-77gk5:~$

Where nr_periods is the total number of 100ms blocks of time the container was given since it started up and nr_throttled is the number of those blocks the container / process did not get enough time to do its work. To figure out the % we just 4905 / 19040 * 100 = ~25%.

I think some throttling is acceptable but I regularly see Agents run at 50% throttling unless the limit.cpu is set above 800m or 8/10 of a vCPU and I can get it below 1% if I use 1000m or 1 vCPU. Setting to 1 vCPU means the container can run 100% of the time.

Threads

The problem of throttling can be made worse by threads. In the example above 10ms of run time might be fine for 1 thread. But if you have 12 threads each one is in a death match for some CPU time and may only get 1-2ms to do their work. Maybe not enough time to do heavy cryptographic workload. I believe ACA-py uses a thread pool whenever run_in_executor is called.

Impact

Our specific case was running a mediator on ACA-py 0.8.x with enough CPU request / limit to make things work fine. This update started causing quite a few timeouts in our AFJ client. When I did a deep dive to triage the issue I noticed the throttling metric. I increased the CPU until I got the percentage below 25 and everything is working as expected (so far). What we saw was that with a few hundred millicores the throttling was about 50% causing mediation connection (probably web-socket) timeouts.

Remedy

I think you need to notably over-provision a container's CPU in a containerized cloud environment. It would be worth further evaluation to find out the impacts of running at high levels of CPU throttling. At a minimum, I think its worth noting in the docs.

swcurran commented 1 year ago

Thanks, @jleach — great write up. So, in your specific case, what parameters did you change — from what to what?

When you say “until I got the percentage below 25” — is that the equivalent of CPU usage — e.g. that you got the percent CPU to below 25? Or for a simple person like me to understand — like calling up the Mac Activity Monitor or Windows Task Manager and looking at the %CPU column? You altered the “limit.cpu” until you got the % CPU to where you wanted it.

jleach commented 1 year ago

@swcurran I had greatly over provision the pod, and the percentage (25) was throttling is the ratio of the run time it was hungry for more CPU time. When I gave the pod < 800m (8/10 of a CPU core) I would see that about 45% of its run time it was throttled. We would see erratic behaviour like WebSocket timeouts when AFJ clients were not connecting. When I gave it 800m the percentage of throttling dropped to 18% and the problem went away: overnight regression tests passed and team testing all worked. If I give it 1 full CPU core it drops below 1% throttling.

A good first step to resolve this would be, if possible, to make ACA-py log thread pool count (number of threads or tasks or whatever is the Python equivalent) when in debug mode. Then we can see if when throttling is high how the thread count varies.

KimEbert42 commented 1 year ago

I believe this is more of the impact of Python Global Interpreter Lock or GIL. Since the Python process is limited to running in basically a single thread, if that thread is limited CPU time, it will be limited to a single core. If that core has a CPU limiting on it, than it simply cannot go faster and will spend a good amount of time in the throttled state.

NodeJS has similiar limitations as it also has a single cpu thread design.

The majority or at least half of the aca-py time is spent inside of the C APIs, which by default locks the GIL. See https://pythonspeed.com/articles/python-gil/

Any service that is blocked may not be able to respond to websocket requests etc, resulting in timeouts.

swcurran commented 1 year ago

Sounds like we need to do some profiling of ACA-Py and AFJ on Kubernetes to understand what is needed when provisioning and scaling environments. Based on the mediator example, we know we can reduce throttling significantly by increasing the provisioning, and in turn we reduce the timeouts we were experiencing.

Top priority is figuring out how to investigate these issues. What tests can we run to understand what is happening and what we can do to influence the behaviours?

And of course, being able to scale mediators is a crucial element of this. How do we scale with web sockets?

jleach commented 1 year ago

@reflectivedevelopment I say threading because its consistent with the behaviour I see: An ACA-py agent will have very low CPU utilization, typically < 1% but a very high throttling ratio, typically > 45%. Usually this pattern indicates problem with threads. I also see this often in Postgres but in that case I think its because it has several processes inside the "container"; each process has the same effect as a thread - they fight for CPU time and in a unimportant process consumes too much then the important ones are starved.

jleach commented 1 year ago
  • Can we monitor timeouts?

These only happen with Websockets when using ACA-py as a mediator. I do have a test script that can easily reproduce and monitor.

  • Can we correlate that with throttling?

I've done this and can reproduce it easily:

  • Can we get some metrics to set provisioning, or is that use case specific?

More details needed.

  • Is it necessary/sufficient to differentiate by mediators, issuers and verifiers?

I can't empirically tell if its a problem for issuers and verifiers. They seem to work fine in this situation and have for a while now. Maybe they preform slower than they should. Its only the mediators that have serious problems.

  • Can/Should we use throttling as a scaling indicator?
  • Do ACA-Py and AFJ behave differently?

I had a look at the AFJ agent used for our showcase. It does present a similar problem but its fall less prominent. More research required. One example is just a data point - not a trend.

Top priority is figuring out how to investigate these issues. What tests can we run to understand what is happening and what we can do to influence the behaviours?

I think we need to understand what ACA-py is doing to have low CPU utilization but high throttling? Too many processes in the container? Threading? Is Indy doing something?

And of course, being able to scale mediators is a crucial element of this. How do we scale with web sockets?

I feel like in its current state a container can't deal with even a few WS connections.

KimEbert42 commented 1 year ago

This doesn't sound quite right to me. Is it perhaps related to a bug in the CFS scheduler? See

https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718 https://github.com/kubernetes/kubernetes/issues/51135#issuecomment-373454012 https://github.com/kubernetes/kubernetes/issues/67577 https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1

jleach commented 1 year ago

@reflectivedevelopment I know of this issue and checked with RedHat to see if the fix was back-ported to our kernel version - it was. It's a good point though, it would be worth deploying the mediator to different k8s infrastructure to see how it preforms there.

As a side not we run ~10,000 containers on our clusters. If this was a kernel bug or infra I'd expect it to present in a large number of containers.

@swcurran We should rule out infra issues by deploying to a different managed k8s environment.

jleach commented 1 year ago

I added run-2157.sh to test for the Websocket timeout issue. Just run it (in the accompanying .devContainer) and in the background tweak the amount of CPU a pod gets. You'll see the failure rate change accordingly.

swcurran commented 1 year ago

I think that it would be tricky to test like that. We would need to have something reproduce the problem on our k8s implementation, and then run the identical test on another k8s. Do we have such a portable test?

It was a bit hard to follow all of the links to determine what versions of the Linux kernel has the bug and the fix, but from what I can see — it is fixed in 4.18 and later, that was released circa 2020. So I would think the easiest check is to see if our OpenShift deployment is running with a fixed kernel. I assume that the fix is on the infra, not in containers (so many kernels 😄 ) so that should be sort of easy to check. Asking Red Hat is probably the right way to go.

KimEbert42 commented 1 year ago

@reflectivedevelopment I say threading because its consistent with the behaviour I see: An ACA-py agent will have very low CPU utilization, typically < 1% but a very high throttling ratio, typically > 45%. Usually this pattern indicates problem with threads. I also see this often in Postgres but in that case I think its because it has several processes inside the "container"; each process has the same effect as a thread - they fight for CPU time and in a unimportant process consumes too much then the important ones are starved.

If we are waiting on other threads, we should see the CPU time be 45%, just with unimportant threads. The threads are either blocked, waiting for resources from postgres, disk, etc, or are being starved by the CPU scheduler. If we are seeing 100% CPU usage with no quota, but 1% CPU usage with a 90% quota, I'm inclined to think the scheduler is not scheduling the process.

jleach commented 1 year ago

@reflectivedevelopment 🤔 I think waiting on db would show a process bing in the WAIT state - it wouldn't count as a throttled block. The only way I can account for the behaviour I see is with threads (or many processes in the container).

The way I understand the scheduler to work is like this: If a container has a request.cpu of 50m and a limit.cpu of 100m it gets 10ms to run. If you have 10 threads they can all run at the same time if cores are free, but, that means you burn through your 10ms faster. Each thread gets 1ms to run. Because this process consumed its 10ms of quota in 1ms its gets throttled thus nr_throttled gets incremented by 1 and the kernel un-schedules the process. Also, because CPU usage is based on the request.cpu it shows as low CPU usage because 1ms is < 5ms (50m).

This meets all the criteria. Its throttled because it used up its quota in the first 1ms (much less than its 10ms as per the limit) and has a very low CPU utilization because 1ms is much lower than 5ms.

KimEbert42 commented 1 year ago

@jleach This would be interesting if Aca-Py used more than one CPU core on a regular basis. It appears from my testing that Aca-Py almost always uses one core and has very little if any parallel processing, and is mostly serialized in its CPU usage.

However, if the statment is true that the Aca-Py requires more than one CPU to run, it would be necessary to include additional CPU time to keep it operating correctly. If you don't give a car sufficient gas, it won't propel forwards, it'll stall. I imagine some programs behave in a similar manner.

Still, this doesn't explain the 1% cpu utilization. If the schedular is properly scheduling the threads, then each thread should register CPU time, and we should see that in the CPU utilization. If I have a multi-threaded application using two CPUs, it will register as using 200% of the CPU. So even if Aca-Py has threads running on multiple cores, it should show up as part of the CPU usage.

KimEbert42 commented 1 year ago

@jleach for the websocket disconnects, we may want to adjust the ping and timeout settings.

My recommendation is based off of research of the QUIC protocol, used for HTTP3. QUIC was designed by a google engineer and makes up half of Google Chromes traffic. The reason for the selection is to base decisions upon prior art that has been successfully deployed.

RFC 9308: Applicability of the QUIC Transport Protocol .

In RFC 9308, it indicates the following:

"According to a 2010 study ([Hatonen10]), UDP applications can assume that any NAT binding or other state entry can expire after just thirty seconds of inactivity. Section 3.5of [RFC8085] further discusses keep-alive intervals for UDP: it requires that there is a minimum value of 15 seconds, but recommends larger values, or that keep-alive is omitted entirely."

And

"Sending PING frames more frequently than every 30 seconds over long idle periods may result in excessive unproductive traffic in some situations and unacceptable power usage for power-constrained (mobile) devices. Additionally, timeouts shorter than 30 seconds can make it harder to handle transient network interruptions, such as Virtual Machine (VM) migration or coverage loss during mobility. See [RFC8085], especially Section 3.5."

I would therefor recommend setting the keep alive to 15 seconds, the minimum defined in RFC8085, and setting the timeout to 60 seconds to allow connections to survive coverage loss during mobility.

The current Aca-Py Default timeout is too short, and may result in unnecessary mobile disconnects.

The current Aca-Py Default keepalive is too short and may result in excessive network traffic.

I believe these can be tuned with --ws-heartbeat-interval ( 3 seconds default) and --ws-timeout-interval ( 15 seconds default)

WadeBarnes commented 1 year ago

@jleach, We can easily try out the settings by adding ACAPY_WS_HEARTBEAT_INTERVAL=15 and ACAPY_WS_TIMEOUT_INTERVAL=60 to the environment variables of the agent's DC.

cvarjao commented 1 year ago

I've seen this error pop back again during an onboarding session last Friday (Match 17, 2:30 - 3PM PT). Fresh install, first wallet initialization. The workaround was to delete the app and try again. The 2nd time it worked with no problem.

jleach commented 1 year ago

@cvarjao We only rolled out the above changes to test. Also, in prod we have 800m/800m which drops throttling to about 18%. Need to go 100m/1,000m probably (assuming request can be low) to get it below 1%.

WadeBarnes commented 1 year ago

@jleach, I made some those updates to prod being tracked here; https://github.com/bcgov/bc-wallet-mobile/issues/990#issuecomment-1478532913

swcurran commented 10 months ago

We think this issue is largely understood and resolved. Closing -- please open other issues with new details.