Python program running slower inside Gvisor sandbox with ARM64

sfc-gh-jyin commented 1 month ago

Description

Hello,

We are currently benchmarking the cpu performance of gvisor compared to normal docker, and found out that same Python program running in gvisor is consistently slower compared to running on native kernel, or even with docker.

Note that we are aware of overhead introduced by additional hook for syscalls, but we are testing the cpu performance, and our test script does not issue syscalls.

The largest difference we observed so far are running on AWS c6gd.2xlarge instance. However, when running same suite on c7 instance family, the performance of gvisor is close to native kernel. Thus we are wondering what might be the rootcause of this, and how can we configure gvisor to make it perform better.

Test Environment: AWS c6gd.2xlarge instance with AL2 ami. Python. Python version: 3.7.16 Test script (Very simple pi calculation):

import time

def calculate_pi(n):
    pi = 0
    sign = 1
    for i in range(1, n * 2, 2):
        pi += sign * (4 / i)
        sign *= -1
    return pi

if __name__ == "__main__":
    iterations = 100000000 
    start = time.time() * 1000
    pi_approx = calculate_pi(iterations)
    print(time.time() * 1000 - start)

Running on native kernel:

$ python3 /tmp/pitest.py
16087.6728515625

Running with docker container:

$ sudo docker run -v /tmp/pitest.py:/tmp/pitest.py amazonlinux:2 yum install -y python3 && python3 /tmp/pitest.py
16197.6875

Running with runsc:

$ ./bin/runsc --network=none --rootless --platform=systrap run id-10
17876.38525390625

In all three cases, the process is consuming nearly 100% of cpu all time. However, when I use perf tool to check the stats, it shows that process started by gvisor runs with around 10~15% slower in terms on instructions per cycle:

Native:


27,517,274,919      cycles                    #    2.482 GHz                      (29.92%)
88,730,249,747      instructions              #    3.22  insn per cycle         
                                              #    0.01  stalled cycles per insn  (29.94%)
     11,087.35 msec cpu-clock                 #    0.923 CPUs utilized          
38,231,913,137      cache-references          # 3448.309 M/sec                    (30.03%)
       258,817      cache-misses              #    0.001 % of all cache refs      (20.02%)
    34,731,416      branch-misses                                                 (20.02%)
 1,036,739,378      stalled-cycles-frontend   #    3.77% frontend cycles idle     (20.02%)
 1,103,210,544      stalled-cycles-backend    #    4.01% backend cycles idle      (20.02%)
             5      sched:sched_switch        #    0.000 K/sec                  
11,080,134,270      sched:sched_stat_runtime  #  999.367 M/sec                  
             1      page-faults               #    0.000 K/sec                  
       281,234      L1-dcache-load-misses                                         (20.02%)
             0      cpu-migrations            #    0.000 K/sec                  
     11,086.95 msec task-clock                #    0.923 CPUs utilized          
27,549,011,550      bus-cycles                # 2484.770 M/sec                    (20.00%)
38,076,864,326      mem_access                # 3434.324 M/sec                    (19.91%)


* Gvisor:

28,392,817,179      cycles                    #    2.482 GHz                      (29.98%)
81,657,764,151      instructions              #    2.88  insn per cycle         
                                              #    0.04  stalled cycles per insn  (30.07%)
     11,440.44 msec cpu-clock                 #    0.953 CPUs utilized          
35,172,201,924      cache-references          # 3074.430 M/sec                    (30.15%)
       922,867      cache-misses              #    0.003 % of all cache refs      (20.11%)
    32,162,856      branch-misses                                                 (20.02%)
   670,976,075      stalled-cycles-frontend   #    2.36% frontend cycles idle     (19.93%)
 3,126,416,518      stalled-cycles-backend    #   11.01% backend cycles idle      (19.93%)
             1      sched:sched_switch        #    0.000 K/sec                  
11,440,714,362      sched:sched_stat_runtime  # 1000.042 M/sec                  
             1      page-faults               #    0.000 K/sec                  
       259,948      L1-dcache-load-misses                                         (19.93%)
             0      cpu-migrations            #    0.000 K/sec                  
     11,440.04 msec task-clock                #    0.952 CPUs utilized          
28,426,171,668      bus-cycles                # 2484.754 M/sec                    (19.93%)
35,158,019,280      mem_access                # 3073.190 M/sec                    (19.93%)


We suspect that this could be due to memory access delay as on Gvisor case `stalled-cycles-backend` is significantly higher compared to other cases.

### Steps to reproduce

1. Create AWS instance with c6 family. eg. `c6gd.2xlarge`
2. Run the script mentioned above in 3 environments

### runsc version

```shell
runsc version 0.0.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

5.10.216-204.855.amzn2.aarch64 #1 SMP Sat May 4 16:53:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

sfc-gh-jyin commented 1 month ago

After some investigation, I found one potential cause of this is that with Gvisor, Sentry implements its own runsc-memfd backed memory, and maps application virtual address space to memfd offset with its own VMA. However, even after the initial page fault, the memory access (especially memory write operations) are slower.

This seems not be directly related to Gvisor, as after the initial page fault after mmap, the memory access should be consistent with native linux kernel. However, the issue is memfd backed memory itself. For some reason, on c6gd AWS instance family, memory write operation through memfd always tends to be slower compared to directly writing to memory by around 5%.

avagin commented 1 month ago

@sfc-gh-jyin I think I found the real root cause of this issue. gVisor never sets the SSBS bit in pstate. With the next patch, I get same results in gVisor and without it:

diff --git a/pkg/sentry/arch/arch_aarch64.go b/pkg/sentry/arch/arch_aarch64.go
index 04262f6c5..e4a1d0187 100644
--- a/pkg/sentry/arch/arch_aarch64.go
+++ b/pkg/sentry/arch/arch_aarch64.go
@@ -257,12 +257,14 @@ func (s *State) FullRestore() bool {
 func New(arch Arch) *Context64 {
        switch arch {
        case ARM64:
-               return &Context64{
+               c:= &Context64{
                        State{
                                fpState: fpu.NewState(),
                        },
-                       []fpu.State(nil),
+                       []fpu.State{nil},
                }
+               c.Regs.Pstate |= linux.PSR_SSBS_BIT
+               return c
        }
        panic(fmt.Sprintf("unknown architecture %v", arch))
 }
diff --git a/pkg/sentry/arch/signal_arm64.go b/pkg/sentry/arch/signal_arm64.go
index 1118d6a7f..959d6068b 100644
--- a/pkg/sentry/arch/signal_arm64.go
+++ b/pkg/sentry/arch/signal_arm64.go
@@ -157,7 +157,7 @@ func (regs *Registers) validRegs() bool {
        }

        // Force PSR to a valid 64-bit EL0t
-       regs.Pstate &= linux.PSR_N_BIT | linux.PSR_Z_BIT | linux.PSR_C_BIT | linux.PSR_V_BIT
+       regs.Pstate &= linux.PSR_N_BIT | linux.PSR_Z_BIT | linux.PSR_C_BIT | linux.PSR_V_BIT | linux.PSR_SSBS_BIT
        return false
 }

This isn't a proper fix. We need to figure out when SSBS should be set.

sfc-gh-jyin commented 1 month ago

Thank you @avagin! I tried your patch and it did help! Can we get this fix merged in main? Also, do you know why this issue did not manifest at similar degree for c7gd instances?

sfc-gh-jyin commented 3 weeks ago

@avagin I have another question... Based on my understanding, PSR_SSBS_BIT enables mitigation to security vulnerabilities introduced by Speculative Execution. Can you share some information on why setting this flag would improve the performance on gVisor?

jaingaurav commented 3 weeks ago

Further, adding to @sfc-gh-jyin's questions, is there a reason that c7 instances would not experience this slowdown? I believe c6 are Graviton2 (Neoverse N1) and c7 are Graviton3 (Neoverse V1).

avagin commented 3 weeks ago

@sfc-gh-jyin it isn't only about gvisor. When you run you test on LInux, this bit is set in pstate and this is why you see a better performance. If you care about security and want to be safe from ssb, you probably want to disable PR_SPEC_STORE_BYPASS that is effectively drops PSR_SSBS_BIT from pstate.

More info about the meaning of this bit can be found here: https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/SSBS--Speculative-Store-Bypass-Safe.

The last line in my previous comment says that the patch isn't a fix and it is just there for explaining what is going on. We need to figure out when we can/need to set this bit. It should not be set by default to protect against SSB.

avagin commented 3 weeks ago

@jaingaurav My guess is that they found another way to mitigate SSB in these cpu-s.

google / gvisor