Open Gal-Lahat opened 3 months ago
Here are some High-Frequency Syscalls on the host (running an idle express js app with 0 requests) (about 30% cpu on runc):
1. High-Frequency Syscalls:
• sys_enter_write: 6 million calls, which suggests a lot of data writing operations are happening.
• sys_enter_futex: 30,000 calls, indicating heavy use of synchronization primitives like mutexes.
• sys_enter_nanosleep: 4,000 calls, which could imply the program is frequently sleeping for short periods.
Could you share a reproducer workload? (Like a Dockerfile or something) And what environment are you using? What CPU? What Linux version? What runsc platform (if you are not explicitly setting --platform flag, then you must be using systrap platform)?
Most of my tests are in a Docker Compose environment running a service that is built using a Dockerfile. The runc runtime is set as the default on the Docker daemon, and all of this runs on a VPS hosted on Contabo. I am running docker-compose up --build
as a simple way to execute it.
I’m using Ubuntu 20.04. Here’s a summary of the CPU information from the VPS:
• CPU MHz: 2496.248
• Hypervisor vendor: KVM
• Virtualization type: full
• Caches: L1 - 256 KiB, L2 - 2 MiB, L3 - 16 MiB
I haven’t explicitly set the --platform flag, so I assume it’s using the sysstrap platform. Below is an example of one of the services defined in my Docker Compose setup:
2pls5ib68:
build:
context: ./2pls5ib68
dockerfile: Dockerfile
ports:
- 3011:80
networks:
- 2pls5ib68
restart: always
logging:
driver: local
options:
max-size: 2m
max-file: "3"
deploy:
resources:
limits:
memory: 6G
volumes:
- /loop-devices-mount/2pls5ib68:/app
If you need anything else, like more configuration details or further clarification, please let me know.
I actually experimented with the old legacy platform ptrace, and it seemed to resolve some of the performance issues I was facing. Specifically, the idle CPU usage dropped significantly, from around 30% to 0.5%. This is a noticeable improvement, although the overall performance isn’t yet optimal. I still need to conduct more tests under heavy CPU workloads to determine whether the improvement is limited to idle performance or if it positively affects performance across the board. Based on this, it seems like there might be an issue with the new systrap platform that needs further investigation.
@avagin @konstantin-s-bogom for systrap. Yeah if switching to ptrace improves things, then likely an issue with systrap.
What is the application doing though? Like what CPU-intensive workload are you using? So we can reproduce.
+1 to a reproducer workload; multi-core applications use multiple cores in different ways and Systrap tries to do some heuristics to work well with most of them. So being able to reproduce what this specific application is doing is necessary in order to understand this problem.
I'd also note that Contabo is notorious for highly oversubscribing its machines, and having unreliable and inconsistent performance over time. I've experienced this first-hand; with disk I/O bandwidth I'd get 10x performance difference on some days vs others. You can look up reviews for Contabo online and that's usually the first thing they'll mention. The other thing they'll mention is the low price, not coincidentally.
So I suggest reproducing this on your local machine or on some other dedicated hardware. I'm not putting the blame on Contabo; it's quite likely that there is something suboptiomal about the way Systrap uses multiple cores for this particular workload, as it has had this type of problem in the past (see issue #9119). All I'm saying is that Contabo is not a reliable environment to get performance measurements from.
Another thing you may want to try is to build runsc
after changing the following line:
to:
neverEnableFastPath = true
From the way the variable is named, this sounds like it would hurt performance, and in most cases it should. Setting this to true
removes the "fast path" feature of Systrap, which involves using spare CPU cores to achieve faster syscall handling performance. But in the case of a very busy system, which I think may be the case here, it might hurt more than it helps. So try to see what happens when you disable fast path (by setting neverEnableFastPath
to true
).
Hey we experienced a similar issue while running a gRPC server.We noticed no matter how much the load increased the CPU utilization stayed below 20%.
The host system was configured with an Intel(R) Xeon(R) Silver 4114 CPU running at 2.20 GHz, 64 GB of RAM. The operating system used was Ubuntu 22.04.3 LTS. runsc: release-20240807.0 We were using the default systrap platform
Description
I’m experiencing significant performance degradation when using GVisor with more than one CPU core. When running a container with a single CPU, everything works as expected. However, when I allocate two or more CPU cores, the container becomes extremely slow and idles at around 20% CPU usage, rendering it nearly unusable.
This issue occurs consistently across all containers I’ve tested, including completely empty containers, which exhibit the same performance degradation. Interestingly, even if the containers are only running a single thread, it seems like all cores of the CPU, when using multiple cores (e.g., 4 cores), experience high load, contributing to the overall slowdown.
Steps to reproduce
runsc version
docker version (if using docker)
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response