google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.85k stars 1.3k forks source link

[Systrap] Syscall return value not properly sent back to user space on AWS c7gd instances with older kernels #10900

Closed sfc-gh-jyin closed 1 month ago

sfc-gh-jyin commented 2 months ago

Description

Hello,

We are currently working on a benchmark to compare gVisor performance on AWS c6gd instances vs c7gd instances. However, when we are running the same workload, our application works fine in c6gd instance, but failed in c7gd instance.

Basically the issue manifests in two ways on c7gd:

1. Getting file status of a bind-mounted file, log shows it succeeds, but client side shows it is not:

On a bad occurrence, the log looks like following:

I0911 02:18:21.589471    4242 strace.go:570] [   1:   1] MyApp E fstatat(AT_FDCWD /, 0xed98acc2ae60 /usr/lib/python3.8/lib-dynload, 0xefd5ddbd6b28, 0x0)
I0911 02:18:21.589489    4242 strace.go:608] [   1:   1] MyApp X fstatat(AT_FDCWD /, 0xed98acc2ae60 /usr/lib/python3.8/lib-dynload, 0xefd5ddbd6b28 {dev=27, ino=6, mode=S_IFDIR|0o770, nlink=2, uid=65534, gid=0, rdev=0, size=8192, blksize=4096, blocks=16, atime=2024-09-10 22:15:13.216024639 +0000 UTC, mtime=2024-09-10 22:15:13.216024639 +0000 UTC, ctime=2024-09-10 22:15:13.216024639 +0000 UTC}, 0x0) = 0 (0x0) (10.512µs)
I0911 02:18:21.589575    4242 strace.go:564] [   1:   1] MyApp E clock_gettime(0x0, 0xefd5ddbd9fb0)
I0911 02:18:21.589584    4242 strace.go:602] [   1:   1] MyApp X clock_gettime(0x0, 0xefd5ddbd9fb0 {sec=1726021101 nsec=589581481}) = 0 (0x0) (2.861µs)
rv E gettid()
I0911 02:18:21.589628    4242 strace.go:596] [   1:   1] MyApp X gettid() = 1 (0x1) (1.324µs)
I0911 02:18:21.589645    4242 strace.go:567] [   1:   1] MyApp E write(0x2 host:[5], 0xed98ac95f5c4 "E20240911 02:18:21.589581     1 MyApp.cpp:111] Python Interpreter Error:\nModuleNotFoundError: No module named 'math'\n", 0xef)
I0911 02:18:21.589666    4242 strace.go:605] [   1:   1] MyApp X write(0x2 host:[5], ..., 0xef) = 239 (0xef) (14.626µs)

On a good occurrence, the log looks like following:

911 23:59:40.458785   10995 strace.go:570] [   1:   1] MyApp E fstatat(AT_FDCWD /, 0xfba817324950 /usr/lib/python3.8/lib-dynload, 0xf16864654ac8, 0x0)
911 23:59:40.458803   10995 strace.go:608] [   1:   1] MyApp X fstatat(AT_FDCWD /, 0xfba817324950 /usr/lib/python3.8/lib-dynload, 0xf16864654ac8 {dev=27, ino=6, mode=S_IFDIR|0o770, nlink=2, uid=65534, gid=0, rdev=0, size=8192, blksize=4096, blocks=16, atime=2024-09-11 23:50:32.106490787 +0000 UTC, mtime=2024-09-11 23:50:32.106490787 +0000 UTC, ctime=2024-09-11 23:50:32.106490787 +0000 UTC}, 0x0) = 0 (0x0) (10.338µs)
911 23:59:40.458822   10995 strace.go:570] [   1:   1] MyApp E fstatat(AT_FDCWD /, 0xfba81736e950 /usr/lib/python3.8/lib-dynload/math.cpython-38-aarch64-linux-gnu.so, 0xf16864654748, 0x0) 
911 23:59:40.458845   10995 strace.go:608] [   1:   1] MyApp X fstatat(AT_FDCWD /, 0xfba81736e950 /usr/lib/python3.8/lib-dynload/math.cpython-38-aarch64-linux-gnu.so, 0xf16864654748 {dev=27, ino=418, mode=S_IFREG|0o775, nlink=2, uid=65534, gid=0, rdev=0, size=83952, blksize=4096, blocks=164, atime=2024-09-11 23:50:31.896489909 +0000 UTC, mtime=2024-03-20 19:55:28.7605355 +0000 UTC, ctime=2024-09-11 23:50:32.106490787 +0000 UTC}, 0x0) = 0 (0x0) (15.4µs)

As we can see from above, the fstatat syscall succeeded in both good and bad occurrence for /usr/lib/python3.8/lib-dynload directory, however somehow user space is not receiving correct return value, and throws error showing module not found.

2. Allocating memory for new stack, log showing it succeeds, but caller does not think so:

I0911 02:18:18.578550    3974 strace.go:576] [  11:  11] MyApp E mmap(0x0, 0x810000, 0x0, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, 0xffffffffffffffff (bad FD), 0x0)
I0911 02:18:18.578560    3974 strace.go:614] [  11:  11] MyApp X mmap(0x0, 0x810000, 0x0, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, 0xffffffffffffffff (bad FD), 0x0) = 276353957404672 (0xfb57ab76a000) (4.147µs)
I0911 02:18:18.578749    3974 strace.go:567] [  11:  11] MyApp E write(0x2 host:[3], 0xfa92f1d297d0 "MyApp: allocatestack.c:379: allocate_stack: Assertion `mem != NULL' failed.\n", 0x58)

It is failing at https://github.com/bminor/glibc/blob/master/nptl/allocatestack.c#L380, which is pretty clear that user space receives result of above mmap as 0, instead 0xfb57ab76a000 as what sentry suggests. Not sure if the shared memory region is corrupted or overridden by other writer.

In both cases, we are seeing discrepancy between what gVisor sentry sends back and what user space receives.

3. Getting unhandled user fault right after mmap:

I0913 21:22:02.424703   29442 strace.go:576] [  29:  42] resolver-execut E mmap(0x0, 0x10000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0xffffffffffffffff (bad FD), 0x0)
I0913 21:22:02.424734   29442 strace.go:614] [  29:  42] resolver-execut X mmap(0x0, 0x10000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0xffffffffffffffff (bad FD), 0x0) = 269091220566016 (0xf4bcae9d3000) (22.269µs)
D0913 21:22:02.424863   29442 task_run.go:313] [  29:  42] Unhandled user fault: addr=0 ip=1066298 access=r-- sig=11 err=bad address
D0913 21:22:02.424883   29442 task_log.go:87] [  29:  42] Registers:
D0913 21:22:02.424900   29442 task_log.go:94] [  29:  42] Pc       = 0000000001066298
D0913 21:22:02.424914   29442 task_log.go:94] [  29:  42] Pstate   = 0000000060001000
D0913 21:22:02.424933   29442 task_log.go:94] [  29:  42] R0       = 000000004c833e95
D0913 21:22:02.424942   29442 task_log.go:94] [  29:  42] R1       = 0000000000000001
D0913 21:22:02.424947   29442 task_log.go:94] [  29:  42] R10      = 00000000040c7000
D0913 21:22:02.424951   29442 task_log.go:94] [  29:  42] R11      = 0000000000000302
D0913 21:22:02.424954   29442 task_log.go:94] [  29:  42] R12      = 0000000000000303
D0913 21:22:02.424957   29442 task_log.go:94] [  29:  42] R13      = 0000000000000405
D0913 21:22:02.424961   29442 task_log.go:94] [  29:  42] R14      = 0000000000000003
D0913 21:22:02.424964   29442 task_log.go:94] [  29:  42] R15      = 0000000000000019
D0913 21:22:02.425081   29442 task_log.go:94] [  29:  42] R16      = 0000000000000000
D0913 21:22:02.425086   29442 task_log.go:94] [  29:  42] R17      = 0000000000000000
D0913 21:22:02.425090   29442 task_log.go:94] [  29:  42] R18      = 0000000002000004
D0913 21:22:02.425113   29442 task_log.go:94] [  29:  42] R19      = 0000000001483d48
D0913 21:22:02.425117   29442 task_log.go:94] [  29:  42] R2       = 0000000001483d48
D0913 21:22:02.425121   29442 task_log.go:94] [  29:  42] R20      = 0000000000000280
D0913 21:22:02.425125   29442 task_log.go:94] [  29:  42] R21      = 000000000000004b
D0913 21:22:02.425129   29442 task_log.go:94] [  29:  42] R22      = 000000000000001d
D0913 21:22:02.425132   29442 task_log.go:94] [  29:  42] R23      = 0000000001483d50
D0913 21:22:02.425135   29442 task_log.go:94] [  29:  42] R24      = ffffffffb37cc16a
D0913 21:22:02.425158   29442 task_log.go:94] [  29:  42] R25      = 000000004c833e95
D0913 21:22:02.425163   29442 task_log.go:94] [  29:  42] R26      = 000000000000004b
D0913 21:22:02.425166   29442 task_log.go:94] [  29:  42] R27      = 0000000000000000
D0913 21:22:02.425169   29442 task_log.go:94] [  29:  42] R28      = 0000000000010000
D0913 21:22:02.425173   29442 task_log.go:94] [  29:  42] R29      = 0000f4bcaf1fe160
D0913 21:22:02.425176   29442 task_log.go:94] [  29:  42] R3       = 0000000000000022
D0913 21:22:02.425182   29442 task_log.go:94] [  29:  42] R30      = 0000000001066290
D0913 21:22:02.425185   29442 task_log.go:94] [  29:  42] R4       = ffffffffffffffff
D0913 21:22:02.425195   29442 task_log.go:94] [  29:  42] R5       = 0000000000000000
D0913 21:22:02.425199   29442 task_log.go:94] [  29:  42] R6       = 0000f4bcaf1fe4a0
D0913 21:22:02.425203   29442 task_log.go:94] [  29:  42] R7       = 0000f4bcaf1fe4c0
D0913 21:22:02.425207   29442 task_log.go:94] [  29:  42] R8       = 00000000000000de
D0913 21:22:02.425211   29442 task_log.go:94] [  29:  42] R9       = 00000000016a7468
D0913 21:22:02.425214   29442 task_log.go:94] [  29:  42] Sp       = 0000f4bcaf1fe160
D0913 21:22:02.425217   29442 task_log.go:94] [  29:  42] Tls      = 0000f4bcaf1ff5c0
D0913 21:22:02.425220   29442 task_log.go:111] [  29:  42] Stack:
D0913 21:22:02.425224   29442 task_log.go:128] [  29:  42] f4bcaf1fe160: 40 e3 1f af bc f4 00 00 94 9b 05 01 00 00 00 00
D0913 21:22:02.425229   29442 task_log.go:128] [  29:  42] f4bcaf1fe170: 00 00 00 00 00 00 00 00 08 3b 48 01 00 00 00 00
D0913 21:22:02.425236   29442 task_log.go:128] [  29:  42] f4bcaf1fe180: 00 40 6f af bc f4 00 00 ff ff ff ff ff ff ff ff
D0913 21:22:02.425245   29442 task_log.go:128] [  29:  42] f4bcaf1fe1a0: 00 00 00 00 00 00 00 00 b8 fe 2b 00 00 00 00 00
D0913 21:22:02.425250   29442 task_log.go:128] [  29:  42] f4bcaf1fe1b0: 08 9d 23 00 00 00 00 00 d8 e4 1f af bc f4 00 00
D0913 21:22:02.425254   29442 task_log.go:128] [  29:  42] f4bcaf1fe1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425257   29442 task_log.go:128] [  29:  42] f4bcaf1fe1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425281   29442 task_log.go:128] [  29:  42] f4bcaf1fe1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425301   29442 task_log.go:128] [  29:  42] f4bcaf1fe1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425306   29442 task_log.go:128] [  29:  42] f4bcaf1fe200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425310   29442 task_log.go:128] [  29:  42] f4bcaf1fe210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425313   29442 task_log.go:128] [  29:  42] f4bcaf1fe220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425317   29442 task_log.go:128] [  29:  42] f4bcaf1fe230: a0 e2 1f af bc f4 00 00 d8 60 06 01 00 00 00 00
D0913 21:22:02.425320   29442 task_log.go:128] [  29:  42] f4bcaf1fe240: 48 3d 48 01 00 00 00 00 48 3d 48 01 00 00 00 00
D0913 21:22:02.425323   29442 task_log.go:128] [  29:  42] f4bcaf1fe250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425346   29442 task_log.go:128] [  29:  42] f4bcaf1fe260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425350   29442 task_log.go:128] [  29:  42] f4bcaf1fe270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425354   29442 task_log.go:128] [  29:  42] f4bcaf1fe280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425358   29442 task_log.go:128] [  29:  42] f4bcaf1fe290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425361   29442 task_log.go:128] [  29:  42] f4bcaf1fe2a0: f0 e2 1f af bc f4 00 00 18 2b 9f 00 00 00 00 00
D0913 21:22:02.425365   29442 task_log.go:128] [  29:  42] f4bcaf1fe2b0: 20 41 48 01 00 00 00 00 e4 5f 06 01 00 00 00 00
D0913 21:22:02.425368   29442 task_log.go:128] [  29:  42] f4bcaf1fe2c0: 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff
D0913 21:22:02.425371   29442 task_log.go:128] [  29:  42] f4bcaf1fe2d0: 50 1f 23 01 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425374   29442 task_log.go:128] [  29:  42] f4bcaf1fe2e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425377   29442 task_log.go:128] [  29:  42] f4bcaf1fe2f0: 20 e3 1f af bc f4 00 00 9c 78 1f 01 00 00 00 00
D0913 21:22:02.425380   29442 task_log.go:128] [  29:  42] f4bcaf1fe300: 48 3d 48 01 00 00 00 00 5f 02 00 00 00 00 00 00
.425384   29442 task_log.go:128] [  29:  42] f4bcaf1fe310: 00 40 6f af bc f4 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425389   29442 task_log.go:128] [  29:  42] f4bcaf1fe320: 40 e3 1f af bc f4 00 00 94 9b 05 01 00 00 00 00
D0913 21:22:02.425407   29442 task_log.go:128] [  29:  42] f4bcaf1fe330: 00 00 00 00 00 00 00 00 08 3b 48 01 00 00 00 00
D0913 21:22:02.425412   29442 task_log.go:128] [  29:  42] f4bcaf1fe340: 80 e3 1f af bc f4 00 00 4c 86 05 01 00 00 00 00
D0913 21:22:02.425416   29442 task_log.go:128] [  29:  42] f4bcaf1fe350: 98 bf 3b 00 00 00 00 00 18 40 6f af bc f4 00 00
D0913 21:22:02.425420   29442 task_log.go:128] [  29:  42] f4bcaf1fe360: 00 40 6f af bc f4 00 00 ff ff ff ff ff ff ff ff
D0913 21:22:02.425424   29442 task_log.go:128] [  29:  42] f4bcaf1fe370: 50 1f 23 01 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425427   29442 task_log.go:128] [  29:  42] f4bcaf1fe380: 10 e4 1f af bc f4 00 00 38 fe 9e 00 00 00 00 00
D0913 21:22:02.425430   29442 task_log.go:128] [  29:  42] f4bcaf1fe390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425433   29442 task_log.go:128] [  29:  42] f4bcaf1fe3a0: ff ff ff ff ff ff ff 7f 18 40 6f af bc f4 00 00
D0913 21:22:02.425437   29442 task_log.go:128] [  29:  42] f4bcaf1fe3b0: 00 40 6f af bc f4 00 00 88 32 00 00 00 00 00 00
D0913 21:22:02.425440   29442 task_log.go:128] [  29:  42] f4bcaf1fe3c0: 00 00 00 00 00 00 00 00 b8 fe 2b 00 00 00 00 00
D0913 21:22:02.425444   29442 task_log.go:128] [  29:  42] f4bcaf1fe3d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425447   29442 task_log.go:128] [  29:  42] f4bcaf1fe3e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425450   29442 task_log.go:128] [  29:  42] f4bcaf1fe3f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425453   29442 task_log.go:128] [  29:  42] f4bcaf1fe400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425456   29442 task_log.go:128] [  29:  42] f4bcaf1fe410: 60 e4 1f af bc f4 00 00 ac 68 9e 00 00 00 00 00
D0913 21:22:02.425459   29442 task_log.go:128] [  29:  42] f4bcaf1fe420: 00 40 6f af bc f4 00 00 a0 a4 25 01 00 00 00 00
D0913 21:22:02.425470   29442 task_log.go:128] [  29:  42] f4bcaf1fe430: c0 f5 1f af bc f4 00 00 18 40 6f af bc f4 00 00
D0913 21:22:02.425484   29442 task_log.go:128] [  29:  42] f4bcaf1fe440: 00 00 00 00 00 00 00 00 a0 a4 25 01 00 00 00 00
D0913 21:22:02.425489   29442 task_log.go:128] [  29:  42] f4bcaf1fe450: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
D0913 21:22:02.425492   29442 task_log.go:128] [  29:  42] f4bcaf1fe460: 20 e5 1f af bc f4 00 00 e8 23 9f 00 00 00 00 00
D0913 21:22:02.425496   29442 task_log.go:128] [  29:  42] f4bcaf1fe470: 40 99 05 b1 bc f4 00 00 48 99 05 b1 bc f4 00 00
D0913 21:22:02.425499   29442 task_log.go:128] [  29:  42] f4bcaf1fe480: 50 99 05 b1 bc f4 00 00 01 00 00 00 00 00 00 00
D0913 21:22:02.425503   29442 task_log.go:128] [  29:  42] f4bcaf1fe490: a0 67 9e 00 00 00 00 00 01 00 00 00 00 00 00 00
D0913 21:22:02.425506   29442 task_log.go:128] [  29:  42] f4bcaf1fe4a0: 00 40 6f af bc f4 00 00 a0 b6 80 00 00 00 00 00
D0913 21:22:02.425509   29442 task_log.go:128] [  29:  42] f4bcaf1fe4b0: 00 30 9f ae bc f4 00 00 c0 f5 1f af bc f4 00 00
D0913 21:22:02.425520   29442 task_log.go:128] [  29:  42] f4bcaf1fe4d0: 40 99 05 b1 bc f4 00 00 b8 fe 2b 00 00 00 00 00
D0913 21:22:02.425524   29442 task_log.go:128] [  29:  42] f4bcaf1fe4e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425529   29442 task_log.go:128] [  29:  42] f4bcaf1fe4f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425533   29442 task_log.go:128] [  29:  42] f4bcaf1fe500: 04 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00
D0913 21:22:02.425536   29442 task_log.go:128] [  29:  42] f4bcaf1fe510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D0913 21:22:02.425539   29442 task_log.go:128] [  29:  42] f4bcaf1fe520: 80 e5 1f af bc f4 00 00 14 1f 5a b1 bc f4 00 00
D0913 21:22:02.425543   29442 task_log.go:128] [  29:  42] f4bcaf1fe530: 00 00 00 00 00 00 00 00 4c f2 1f af bc f4 00 00
D0913 21:22:02.425546   29442 task_log.go:128] [  29:  42] f4bcaf1fe540: c6 cf 19 ff d8 fa 00 00 a0 b6 80 00 00 00 00 00
D0913 21:22:02.425549   29442 task_log.go:128] [  29:  42] f4bcaf1fe550: 00 30 9f ae bc f4 00 00 c7 cf 19 ff d8 fa 00 00
D0913 21:22:02.425552   29442 task_log.go:149] [  29:  42] Code:
D0913 21:22:02.425563   29442 task_log.go:167] [  29:  42] 1066250: e2 03 1c aa 06 00 80 d2 05 00 80 12 44 04 80 52
D0913 21:22:02.425568   29442 task_log.go:167] [  29:  42] 1066260: 63 00 80 52 01 00 80 d2 c0 1b 80 d2 ad 48 06 94
D0913 21:22:02.425571   29442 task_log.go:167] [  29:  42] 1066270: fb 03 00 aa 7f 07 00 b1 40 12 00 54 60 02 40 b9
D0913 21:22:02.425575   29442 task_log.go:167] [  29:  42] 1066280: c0 1d 00 37 e2 03 13 aa 01 00 00 32 9d 43 06 94
D0913 21:22:02.425578   29442 task_log.go:167] [  29:  42] 1066290: 40 1d 00 37 60 03 19 ca 7c 03 00 a9 e1 03 13 aa
D0913 21:22:02.425581   29442 task_log.go:167] [  29:  42] 10662a0: 73 0b 00 f9 60 83 00 91 f2 fe ff 97 63 9e 40 f9
D0913 21:22:02.425584   29442 task_log.go:167] [  29:  42] 10662b0: 7f 00 14 eb e3 f8 ff 54 20 00 80 d2 e1 03 00 2a
D0913 21:22:02.425771   29442 task_log.go:167] [  29:  42] 10662c0: 1f 00 15 eb c8 f9 ff 54 a0 03 80 52 3f 74 00 71
D0913 21:22:02.425775   29442 task_log.go:71] [  29:  42] Mappings:
D0913 21:22:02.426327   29442 task_log.go:73] [  29:  42] FDTable:
D0913 21:22:02.426363   29442 task_signals.go:470] [  29:  42] Notified of signal 11
D0913 21:22:02.426372   29442 task_signals.go:220] [  29:  42] Signal 11: delivering to handler
I0913 21:22:02.426554   29442 strace.go:559] [  29:  42] resolver-execut E gettid()
I0913 21:22:02.426568   29442 strace.go:596] [  29:  42] resolver-execut X gettid() = 42 (0x2a) (1.058µs)
I0913 21:22:02.427222   29442 strace.go:567] [  29:  42] resolver-execut E getcpu(0xf4bcaf1fdc1c, 0x0, 0x0)
I0913 21:22:02.427232   29442 strace.go:605] [  29:  42] resolver-execut X getcpu(0xf4bcaf1fdc1c, 0x0, 0x0) = 0 (0x0) (1.596µs)
I0913 21:22:02.427600   29442 strace.go:564] [  29:  42] resolver-execut E clock_gettime(0x0, 0xf4bcaf1fdc10)
I0913 21:22:02.427614   29442 strace.go:602] [  29:  42] resolver-execut X clock_gettime(0x0, 0xf4bcaf1fdc10 {sec=1726262522 nsec=427610261}) = 0 (0x0) (2.356µs)
I0913 21:22:02.427672   29442 strace.go:567] [  29:  42] resolver-execut E write(0x2 host:[2], 0xf4bcaf1fdca0 "*** SIGSEGV received at time=1726262522 on cpu 10 ***\n", 0x36)
I0913 21:22:02.427691   29442 strace.go:605] [  29:  42] resolver-execut X write(0x2 host:[2], ..., 0x36) = 54 (0x36) (10.918µs)
I0913 21:22:02.427803   29442 strace.go:570] [  29:  42] resolver-execut E rt_sigprocmask(SIG_SETMASK, 0x1233308 [SIGHUP SIGINT SIGQUIT SIGILL SIGTRAP SIGABRT SIGBUS SIGFPE SIGKILL SIGUSR1 SIGSEGV SIGUSR2 SIGPIPE SIGALRM SIGTERM SIGSTKFLT SIGCHLD SIGCONT SIGSTOP SIGTSTP SIGTTIN SIGTTOU SIGURG SIGXCPU SIGXFSZ SIGVTALRM SIGPROF SIGWINCH SIGIO SIGPWR SIGSYS 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64], 0xf4bcaf1fcda0, 0x8)
I0913 21:22:02.427814   29442 strace.go:608] [  29:  42] resolver-execut X rt_sigprocmask(SIG_SETMASK, 0x1233308 [SIGHUP SIGINT SIGQUIT SIGILL SIGTRAP SIGABRT SIGBUS SIGFPE SIGKILL SIGUSR1 SIGSEGV SIGUSR2 SIGPIPE SIGALRM SIGTERM SIGSTKFLT SIGCHLD SIGCONT SIGSTOP SIGTSTP SIGTTIN SIGTTOU SIGURG SIGXCPU SIGXFSZ SIGVTALRM SIGPROF SIGWINCH SIGIO SIGPWR SIGSYS 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64], 0xf4bcaf1fcda0 [], 0x8) = 0 (0x0) (2.915µs)

Followings are some investigations we have done so far:

  1. c7gd instance is using Neoverse V1, which is based on ARMv8.4-A instruction set and partially ARMv8.6-A; c6gd instance is using Neoverse N1 basing on ARMv8.2-A. Neoverse V1 does introduce additional features related to registers, such as Scalable Vector Extension, but it is unclear how this is related to systrap, as systrap uses general purpose registers to store the syscall result.
  2. This issue is not always happening. 1 in 10 chance our application would succeed. But when the issue happens, it always fails at exactly the same place. ie. There are many fstatat syscalls made on bind-mount files prior to this failing point, and they all work fine.
  3. Currently this issue seems to only occur on older kernel version (eg. 5.4.181-99.354) along with c7gd instance combination. We tested newer kernel version (5.10.215-203.850) on c7gd, and older kernel version (5.4.181-99.354) on c6gd, both cases are working fine.
  4. This issue only occurs on systrap platform. It is working fine when we switch to ptrace
  5. We have tried avoiding all other variables such as software version (glibc etc), and confirmed this is not issue of the dependency libraries

Steps to reproduce

We are still working on a simple program to reproduce this issue. But in the meantime please let us know if there is any obvious cause to this issue. Thanks!

runsc version

We are building runsc with `release-20240807.0` release tag

docker version (if using docker)

No response

uname

5.4.181-99.354

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

sfc-gh-jyin commented 2 months ago

One quick update:

I was doing another comparison on r7gd.4xlarge instances. The lscpu command returns different results on different kernel versions:

The one has the issue (Kernel version: 5.4.181-99.354):

Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Model:                 1
BogoMIPS:              2100.00
L1d cache:             64K
L1i cache:             64K
L2 cache:              1024K
L3 cache:              32768K
NUMA node0 CPU(s):     0-15
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp

The one does not have issue (Kernel version: 5.10.215-203.850):

Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Model:                 1
BogoMIPS:              2100.00
L1d cache:             64K
L1i cache:             64K
L2 cache:              1024K
L3 cache:              32768K
NUMA node0 CPU(s):     0-15
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

The node that works fine have those additional feature flags: svei8mm svebf16 i8mm bf16 dgh rng. It might be because of the kernel version difference.

sfc-gh-jyin commented 1 month ago

In last couple days, I tried to come up with a simple repro program, but it seems this only happens in a specific sequence, and I couldn't reproduce it with simple program. However, after some investigation, I believe I have found an interesting cause which might explain why the system call return value is overridden.

In some cases when calling read system call to read content from a socket file and populate the content to a buffer, even though gVisor sentry showing the read is successfully, the user space gets return value of 0, indicating 0 bytes were read:

kernel:
I0923 16:12:40.334727   10407 strace.go:567] [   1:   1] MyApp E read(0xe host:[4], 0xe48d913949e8, 0x10000)
I0923 16:12:40.335491   10407 strace.go:605] [   1:   1] MyApp X read(0xe host:[4], 0xe48d913949e8 "<content>", 0x10000) = 644 (0x284) (746.795µs)

I0923 16:12:40.335529   10407 subprocess.go:859] [Custom Logs] Encountered page fault. Current reg[0] is 0
D0923 16:12:40.335540   10407 task_run.go:314] [   1:   1] [Custom Logs] Encountered user fault: addr=e48d913948c0 ip=f3728b5e2904 access=r-- sig=11 err=interrupted by signal
I0923 16:12:40.335549   10407 syscalls.go:42] [Custom Logs] Page fault start: 0xe48d91394000; end: 0xe48d91395000. Access type: r--

I0923 16:12:40.336995   10407 strace.go:567] [   1:   1] MyApp E write(0x2 host:[3], 0xf3728b0ce684 "I20240923 16:12:40.336920     1 MyApp.cpp:111] Read returns 0 bytes. \n", 0x61)

Interesting thing is that right after the read system call, there was a immediate page fault occurred. The page fault resets the reg[0] to 0. After the page fault is resolved, the reg[0] remains 0, causing user space application thought read syscall returns 0.

For both read syscall and page fault, PC register remains exactly the same. And it is pointing to following instructions:


01 00 00 d4 -> SVC triggering read syscall 

f3 03 00 aa -> PC pointing to this MOV instruction. This is causing the page fault

Register values right after the page fault, as we can see reg[8] still showsread` syscall, however reg[0] was already reset to 0:

D0923 16:12:40.335586   10407 task_log.go:87] [   1:   1] Registers:
D0923 16:12:40.335598   10407 task_log.go:94] [   1:   1] Pc       = 0000f3728b5e2904
og.go:94] [   1:   1] Pstate   = 0000000080001000
D0923 16:12:40.335611   10407 task_log.go:94] [   1:   1] R0       = 0000000000000000
D0923 16:12:40.335615   10407 task_log.go:94] [   1:   1] R1       = 0000e48d913949e8
D0923 16:12:40.335618   10407 task_log.go:94] [   1:   1] R10      = 0000e48d913919f0
D0923 16:12:40.335621   10407 task_log.go:94] [   1:   1] R11      = 0000000000141530
D0923 16:12:40.335625   10407 task_log.go:94] [   1:   1] R12      = 00000000000000c0
D0923 16:12:40.335628   10407 task_log.go:94] [   1:   1] R13      = 0000000000000008
D0923 16:12:40.335631   10407 task_log.go:94] [   1:   1] R14      = 00000000000002a4
D0923 16:12:40.335634   10407 task_log.go:94] [   1:   1] R15      = 000000000000000a
D0923 16:12:40.335637   10407 task_log.go:94] [   1:   1] R16      = 0000000001232a78
D0923 16:12:40.335640   10407 task_log.go:94] [   1:   1] R17      = 0000f3728b5e2890
D0923 16:12:40.335643   10407 task_log.go:94] [   1:   1] R18      = ffffffffffffffff
D0923 16:12:40.335646   10407 task_log.go:94] [   1:   1] R19      = 000000000000000e
D0923 16:12:40.335650   10407 task_log.go:94] [   1:   1] R2       = 0000000000010000
D0923 16:12:40.335653   10407 task_log.go:94] [   1:   1] R20      = 0000000000010000
D0923 16:12:40.335657   10407 task_log.go:94] [   1:   1] R21      = 0000e48d913949e8
D0923 16:12:40.335663   10407 task_log.go:94] [   1:   1] R22      = 0000f3728bbbd7c0
D0923 16:12:40.335666   10407 task_log.go:94] [   1:   1] R23      = 000000000000000e
D0923 16:12:40.335669   10407 task_log.go:94] [   1:   1] R24      = 0000e48d913a4a50
D0923 16:12:40.335672   10407 task_log.go:94] [   1:   1] R25      = 000000000000000e
D0923 16:12:40.335675   10407 task_log.go:94] [   1:   1] R26      = 0000e48d913a4b90
D0923 16:12:40.335678   10407 task_log.go:94] [   1:   1] R27      = 0000f3728b0d5bb8
D0923 16:12:40.335681   10407 task_log.go:94] [   1:   1] R28      = 0000f3728b008c58
D0923 16:12:40.335684   10407 task_log.go:94] [   1:   1] R29      = 0000e48d913948c0
D0923 16:12:40.335687   10407 task_log.go:94] [   1:   1] R3       = 0000000000000000
D0923 16:12:40.335690   10407 task_log.go:94] [   1:   1] R30      = 0000f3728b5e28ec
D0923 16:12:40.335693   10407 task_log.go:94] [   1:   1] R4       = 0000000000000020
D0923 16:12:40.335696   10407 task_log.go:94] [   1:   1] R5       = 6a736f6e00000000
D0923 16:12:40.335699   10407 task_log.go:94] [   1:   1] R6       = 1f73726474706451
D0923 16:12:40.335702   10407 task_log.go:94] [   1:   1] R7       = 7f7f7f7f7f7f7f7f
D0923 16:12:40.335705   10407 task_log.go:94] [   1:   1] R8       = 000000000000003f
D0923 16:12:40.335708   10407 task_log.go:94] [   1:   1] R9       = 0000e48d913919f0
D0923 16:12:40.335711   10407 task_log.go:94] [   1:   1] Sp       = 0000e48d913948c0
D0923 16:12:40.335714   10407 task_log.go:94] [   1:   1] Tls      = 0000f3728bbbd7c0

@avagin @konstantin-s-bogom I will continue doing more investigations, but based on above observation, I have 2 quick questions:

  1. The read syscall populates data into the buffer starting at address 0xe48d913949e8. However, immediately after the syscall completes, a page fault occurred at address 0xe48d91394000, which seems to be in the same page as the buffer. Is this normal behavior? According to my understanding, ideally the previous read system call should have already loaded the page from physical memory, so the page fault should have never occurred. Please correct me if I am wrong.
  2. Usually a page fault does not reset the reg[0], as it does not need this value at all. Do you know who might be resetting the register value back to 0?

Our current suspicion is that c7 (NEOVERSEV1) introduced some new feature such as CCIX which might have caused this issue. But we are still investigating. In the meantime please let me know if you have idea on what's going on here. Thanks!

sfc-gh-jyin commented 1 month ago

@avagin

Ok, I think I found a quick (hacky) fix to this issue.

From above example, it seems the page fault happens at the exact location of the Stack Pointer. This could be a race condition, or an edge case that Page Fault happens right after a syscall.

However, I saw this faulty address was already part of both VMA and PMA, which means a previous fault already should have loaded the memory from physical memory. However we are still getting the fault here again, at a awkward time when the reg[0] is reset to 0.

By looking at the code, I think the problem is https://github.com/google/gvisor/blob/master/pkg/sentry/mm/syscalls.go#L68. When we get user page fault, we are populating PMAs for the VMA entry, then calls mmap to ask native kernel to allocate the memory. However, since we are specifying memmap.PlatformEffectDefault here, it is still a soft memory allocation, and it does not populate the memory immediately. Thus causing the situation where an address already in PMA still causing page fault later.

After hacking the code to change memmap.PlatformEffectDefault to memmap.PlatformEffectCommit, all issues mentioned above are gone.

Is it intentional that we use memmap.PlatformEffectDefault here, even it can potentially cause another page fault at same address at later point? My current suspicion is that for c6 instances, mmap without MAP_POPULATE flag would take shorter time for kernel to populate the memory, while it takes longer on c7 instances. Therefore, on c7 instances, it is more frequent to see such weird page faults, which would reset the reg[0] to 0, erasing the return value of a previous system call. But please correct me if my understanding is incorrect. Thanks!

avagin commented 1 month ago

The read syscall populates data into the buffer starting at address 0xe48d913949e8. However, immediately after the syscall completes, a page fault occurred at address 0xe48d91394000, which seems to be in the same page as the buffer. Is this normal behavior? According to my understanding, ideally the previous read system call should have already loaded the page from physical memory, so the page fault should have never occurred. Please correct me if I am wrong.

Yes, it can be normal. The Sentry and user code are running in different processes on the host.

Usually a page fault does not reset the reg[0], as it does not need this value at all. Do you know who might be resetting the register value back to 0?

It must not reset the reg[0] and honestly I am not sure it is really happening. It is hard to read your logs without knowing all additional changed you have made. I think you need to print registers before resuming a stub process and after returning back to the sentry.

sfc-gh-jyin commented 1 month ago

It must not reset the reg[0] and honestly I am not sure it is really happening. It is hard to read your logs without knowing all additional changed you have made. I think you need to print registers before resuming a stub process and after returning back to the sentry.

Yes here are the register values before returning to stub process and after switching back to sentry process due to page fault (On another run):

Before

D0923 18:40:32.305743    1783 task_log.go:87] [   1:   1] Registers:
D0923 18:40:32.305756    1783 task_log.go:94] [   1:   1] Pc       = 0000edc9ae92a904
D0923 18:40:32.305761    1783 task_log.go:94] [   1:   1] Pstate   = 0000000080001000
D0923 18:40:32.305765    1783 task_log.go:94] [   1:   1] R0       = 0000000000000284
D0923 18:40:32.305769    1783 task_log.go:94] [   1:   1] R1       = 0000f69268c969e8
D0923 18:40:32.305772    1783 task_log.go:94] [   1:   1] R10      = 0000f69268c939f0
D0923 18:40:32.305775    1783 task_log.go:94] [   1:   1] R11      = 0000000000141530
D0923 18:40:32.305778    1783 task_log.go:94] [   1:   1] R12      = 00000000000000c0
D0923 18:40:32.305781    1783 task_log.go:94] [   1:   1] R13      = 0000000000000008 
D0923 18:40:32.305785    1783 task_log.go:94] [   1:   1] R14      = 00000000000002a4 
D0923 18:40:32.305788    1783 task_log.go:94] [   1:   1] R15      = 000000000000000a
D0923 18:40:32.305791    1783 task_log.go:94] [   1:   1] R16      = 0000000001232a78 
D0923 18:40:32.305794    1783 task_log.go:94] [   1:   1] R17      = 0000edc9ae92a890
D0923 18:40:32.305797    1783 task_log.go:94] [   1:   1] R18      = ffffffffffffffff
D0923 18:40:32.305800    1783 task_log.go:94] [   1:   1] R19      = 000000000000000e
D0923 18:40:32.305803    1783 task_log.go:94] [   1:   1] R2       = 0000000000010000
D0923 18:40:32.305806    1783 task_log.go:94] [   1:   1] R20      = 0000000000010000
D0923 18:40:32.305809    1783 task_log.go:94] [   1:   1] R21      = 0000f69268c969e8
D0923 18:40:32.305812    1783 task_log.go:94] [   1:   1] R22      = 0000edc9aef0e7c0
D0923 18:40:32.305815    1783 task_log.go:94] [   1:   1] R23      = 000000000000000e
D0923 18:40:32.305818    1783 task_log.go:94] [   1:   1] R24      = 0000f69268ca6a50 
D0923 18:40:32.305821    1783 task_log.go:94] [   1:   1] R25      = 000000000000000e
D0923 18:40:32.305824    1783 task_log.go:94] [   1:   1] R26      = 0000f69268ca6b90
D0923 18:40:32.305827    1783 task_log.go:94] [   1:   1] R27      = 0000edc9ae2d5878
D0923 18:40:32.305830    1783 task_log.go:94] [   1:   1] R28      = 0000edc9ae208c58
D0923 18:40:32.305833    1783 task_log.go:94] [   1:   1] R29      = 0000f69268c968c0
D0923 18:40:32.305836    1783 task_log.go:94] [   1:   1] R3       = 0000000000000000 
D0923 18:40:32.305840    1783 task_log.go:94] [   1:   1] R30      = 0000edc9ae92a8ec
D0923 18:40:32.305843    1783 task_log.go:94] [   1:   1] R4       = 0000000000000020
D0923 18:40:32.305846    1783 task_log.go:94] [   1:   1] R5       = 6a736f6e00000000
D0923 18:40:32.305850    1783 task_log.go:94] [   1:   1] R6       = 1f73726474706451
D0923 18:40:32.305853    1783 task_log.go:94] [   1:   1] R7       = 7f7f7f7f7f7f7f7f
D0923 18:40:32.305856    1783 task_log.go:94] [   1:   1] R8       = 000000000000003f
D0923 18:40:32.305859    1783 task_log.go:94] [   1:   1] R9       = 0000f69268c939f0
D0923 18:40:32.305862    1783 task_log.go:94] [   1:   1] Sp       = 0000f69268c968c0
D0923 18:40:32.305865    1783 task_log.go:94] [   1:   1] Tls      = 0000edc9aef0e7c0

After

D0923 18:40:32.306674    1783 task_log.go:87] [   1:   1] Registers:
D0923 18:40:32.306683    1783 task_log.go:94] [   1:   1] Pc       = 0000edc9ae92a904
D0923 18:40:32.306687    1783 task_log.go:94] [   1:   1] Pstate   = 0000000080001000
D0923 18:40:32.306690    1783 task_log.go:94] [   1:   1] R0       = 0000000000000000
D0923 18:40:32.306693    1783 task_log.go:94] [   1:   1] R1       = 0000f69268c969e8
D0923 18:40:32.306696    1783 task_log.go:94] [   1:   1] R10      = 0000f69268c939f0
D0923 18:40:32.306699    1783 task_log.go:94] [   1:   1] R11      = 0000000000141530
D0923 18:40:32.306702    1783 task_log.go:94] [   1:   1] R12      = 00000000000000c0
D0923 18:40:32.306705    1783 task_log.go:94] [   1:   1] R13      = 0000000000000008
D0923 18:40:32.306708    1783 task_log.go:94] [   1:   1] R14      = 00000000000002a4
D0923 18:40:32.306712    1783 task_log.go:94] [   1:   1] R15      = 000000000000000a
D0923 18:40:32.306715    1783 task_log.go:94] [   1:   1] R16      = 0000000001232a78
D0923 18:40:32.306718    1783 task_log.go:94] [   1:   1] R17      = 0000edc9ae92a890
D0923 18:40:32.306721    1783 task_log.go:94] [   1:   1] R18      = ffffffffffffffff
D0923 18:40:32.306724    1783 task_log.go:94] [   1:   1] R19      = 000000000000000e
D0923 18:40:32.306727    1783 task_log.go:94] [   1:   1] R2       = 0000000000010000
D0923 18:40:32.306730    1783 task_log.go:94] [   1:   1] R20      = 0000000000010000
D0923 18:40:32.306733    1783 task_log.go:94] [   1:   1] R21      = 0000f69268c969e8
D0923 18:40:32.306736    1783 task_log.go:94] [   1:   1] R22      = 0000edc9aef0e7c0
D0923 18:40:32.306739    1783 task_log.go:94] [   1:   1] R23      = 000000000000000e
D0923 18:40:32.306742    1783 task_log.go:94] [   1:   1] R24      = 0000f69268ca6a50
D0923 18:40:32.306745    1783 task_log.go:94] [   1:   1] R25      = 000000000000000e
D0923 18:40:32.306748    1783 task_log.go:94] [   1:   1] R26      = 0000f69268ca6b90
D0923 18:40:32.306751    1783 task_log.go:94] [   1:   1] R27      = 0000edc9ae2d5878
D0923 18:40:32.306754    1783 task_log.go:94] [   1:   1] R28      = 0000edc9ae208c58
D0923 18:40:32.306757    1783 task_log.go:94] [   1:   1] R29      = 0000f69268c968c0
D0923 18:40:32.306759    1783 task_log.go:94] [   1:   1] R3       = 0000000000000000
D0923 18:40:32.306763    1783 task_log.go:94] [   1:   1] R30      = 0000edc9ae92a8ec
D0923 18:40:32.306766    1783 task_log.go:94] [   1:   1] R4       = 0000000000000020
D0923 18:40:32.306769    1783 task_log.go:94] [   1:   1] R5       = 6a736f6e00000000
D0923 18:40:32.306771    1783 task_log.go:94] [   1:   1] R6       = 1f73726474706451
D0923 18:40:32.306775    1783 task_log.go:94] [   1:   1] R7       = 7f7f7f7f7f7f7f7f
D0923 18:40:32.306778    1783 task_log.go:94] [   1:   1] R8       = 000000000000003f
D0923 18:40:32.306781    1783 task_log.go:94] [   1:   1] R9       = 0000f69268c939f0
D0923 18:40:32.306784    1783 task_log.go:94] [   1:   1] Sp       = 0000f69268c968c0
Tls      = 0000edc9aef0e7c0

I have not made any changes besides adding logs. But as we can see here, the register values are exactly the same before and after, the only difference is reg[0] being reset to 0. I tried to find the exact place where the register reset is happening, but failed to do so...

Also, in this case the address that caused user fault is 0xf69268c968c0, which is exactly at the stack pointer address. Please see my comments in https://github.com/google/gvisor/issues/10900#issuecomment-2369268581, this address is already part of existing PMA and VMA entries.

I wonder if reg[0] is reset to 0 by CPU, or the systrap signal handler, but I can't find any evidence of it.

Please let me know if you need additional information from me. Thanks!

NOTE: I tried to run the same workload on c6 instance with systrap, and c7 instance with ptrace. Both cases had no issues, and I do not see the page fault right after read syscall on them. I guess the discrepancy is the page fault itself. But would still be good if we figure out why the register is reset.

avagin commented 1 month ago

@sfc-gh-jyin Can you give access to a test vm?

sfc-gh-jyin commented 1 month ago

@avagin Unfortunately this issue currently only happens on c7 with our pre-configured kernel with version 5.4. However, I would be happy to share all kernel configs and gvisor logs with you if needed. I am also trying to see if this can be reproduced in a regular c7 nodes.

I spent some time looking deeper into those page faults mentioned above, and found out those SIGSEGV are not actually for page faults. The signal code for those weird SIGSEGV is 2, which represents SEGV_ACCERR. This means that the cpu correctly recognizes the mapping (Not CPU TLB cache issue), but somehow it thought the caller does not have sufficient permission to act on it. In normal runs, I do see SEGV_ACCERR happening, but only immediately follow a SEGV_MAPERR so we know this is a write page fault. But in this case SEGV_ACCERR happens without prior SEGV_MAPERR.

For example, the page is already in PMA: f66d82400000-f66d82484000 rw-p 00054000 *pgalloc.MemoryFile, but it issues SIGSEGV with SEGV_ACCERR: D0925 00:18:19.049054 6619 task_run.go:328] [ 1: 1] Encountered user fault: addr=f66d82483a20 ip=dfd52fe03904 access=r-- sig=11 err=interrupted by signal code=2. From the look of it, it does not seem to be a permission issue, as the map already has read permission on it. But if we add MAP_POPULATE flag to the mmap calls, we no longer see SEGV_ACCERR.

Also it seems those SEGV_ACCERR self-recovers soon after. Do you know if any of the cpu/kernel feature which might result into this behavior?

avagin commented 1 month ago

@sfc-gh-jyin I was able to reproduce the issue. In my case, the root cause is that systrap doesn't save/restore sve states. Could you try out https://github.com/avagin/gvisor/commit/6927283dc7bf4406d7ecbe0551a9f84c7b473269? It works for me.

sfc-gh-jyin commented 1 month ago

Wow that's great news, thank you, @avagin! Let me try it out today!

Do you mind sharing how you reproduced this issue? For some reason I was still not able to reproduce it with simple program

UPDATE: I've tried the patch, it solves the issue for us! Can we have this change merged to main so we can pull it? Thanks!

avagin commented 1 month ago

It took a while to find a reliable way to reproduce this issue. I run the following command from the gvisor git directory:

~/git/gvisor$ bazel-bin/runsc/runsc_/runsc --rootless --network none --debug-log /tmp/a --debug do git grep xxxx .github/
sfc-gh-jyin commented 1 month ago

Thanks, @avagin I can confirm this patch works for us!