Closed sfc-gh-jyin closed 1 month ago
One quick update:
I was doing another comparison on r7gd.4xlarge
instances. The lscpu
command returns different results on different kernel versions:
The one has the issue (Kernel version: 5.4.181-99.354
):
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Model: 1
BogoMIPS: 2100.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-15
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp
The one does not have issue (Kernel version: 5.10.215-203.850
):
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Model: 1
BogoMIPS: 2100.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-15
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
The node that works fine have those additional feature flags: svei8mm svebf16 i8mm bf16 dgh rng
. It might be because of the kernel version difference.
In last couple days, I tried to come up with a simple repro program, but it seems this only happens in a specific sequence, and I couldn't reproduce it with simple program. However, after some investigation, I believe I have found an interesting cause which might explain why the system call return value is overridden.
In some cases when calling read
system call to read content from a socket file and populate the content to a buffer, even though gVisor sentry showing the read is successfully, the user space gets return value of 0, indicating 0 bytes were read:
kernel:
I0923 16:12:40.334727 10407 strace.go:567] [ 1: 1] MyApp E read(0xe host:[4], 0xe48d913949e8, 0x10000)
I0923 16:12:40.335491 10407 strace.go:605] [ 1: 1] MyApp X read(0xe host:[4], 0xe48d913949e8 "<content>", 0x10000) = 644 (0x284) (746.795µs)
I0923 16:12:40.335529 10407 subprocess.go:859] [Custom Logs] Encountered page fault. Current reg[0] is 0
D0923 16:12:40.335540 10407 task_run.go:314] [ 1: 1] [Custom Logs] Encountered user fault: addr=e48d913948c0 ip=f3728b5e2904 access=r-- sig=11 err=interrupted by signal
I0923 16:12:40.335549 10407 syscalls.go:42] [Custom Logs] Page fault start: 0xe48d91394000; end: 0xe48d91395000. Access type: r--
I0923 16:12:40.336995 10407 strace.go:567] [ 1: 1] MyApp E write(0x2 host:[3], 0xf3728b0ce684 "I20240923 16:12:40.336920 1 MyApp.cpp:111] Read returns 0 bytes. \n", 0x61)
Interesting thing is that right after the read
system call, there was a immediate page fault occurred. The page fault resets
the reg[0] to 0. After the page fault is resolved, the reg[0] remains 0, causing user space application thought read
syscall returns 0.
For both read
syscall and page fault, PC
register remains exactly the same. And it is pointing to following instructions:
01 00 00 d4 -> SVC triggering read syscall
f3 03 00 aa -> PC pointing to this MOV instruction. This is causing the page fault
Register values right after the page fault, as we can see reg[8] still shows
read` syscall, however reg[0] was already reset to 0:
D0923 16:12:40.335586 10407 task_log.go:87] [ 1: 1] Registers:
D0923 16:12:40.335598 10407 task_log.go:94] [ 1: 1] Pc = 0000f3728b5e2904
og.go:94] [ 1: 1] Pstate = 0000000080001000
D0923 16:12:40.335611 10407 task_log.go:94] [ 1: 1] R0 = 0000000000000000
D0923 16:12:40.335615 10407 task_log.go:94] [ 1: 1] R1 = 0000e48d913949e8
D0923 16:12:40.335618 10407 task_log.go:94] [ 1: 1] R10 = 0000e48d913919f0
D0923 16:12:40.335621 10407 task_log.go:94] [ 1: 1] R11 = 0000000000141530
D0923 16:12:40.335625 10407 task_log.go:94] [ 1: 1] R12 = 00000000000000c0
D0923 16:12:40.335628 10407 task_log.go:94] [ 1: 1] R13 = 0000000000000008
D0923 16:12:40.335631 10407 task_log.go:94] [ 1: 1] R14 = 00000000000002a4
D0923 16:12:40.335634 10407 task_log.go:94] [ 1: 1] R15 = 000000000000000a
D0923 16:12:40.335637 10407 task_log.go:94] [ 1: 1] R16 = 0000000001232a78
D0923 16:12:40.335640 10407 task_log.go:94] [ 1: 1] R17 = 0000f3728b5e2890
D0923 16:12:40.335643 10407 task_log.go:94] [ 1: 1] R18 = ffffffffffffffff
D0923 16:12:40.335646 10407 task_log.go:94] [ 1: 1] R19 = 000000000000000e
D0923 16:12:40.335650 10407 task_log.go:94] [ 1: 1] R2 = 0000000000010000
D0923 16:12:40.335653 10407 task_log.go:94] [ 1: 1] R20 = 0000000000010000
D0923 16:12:40.335657 10407 task_log.go:94] [ 1: 1] R21 = 0000e48d913949e8
D0923 16:12:40.335663 10407 task_log.go:94] [ 1: 1] R22 = 0000f3728bbbd7c0
D0923 16:12:40.335666 10407 task_log.go:94] [ 1: 1] R23 = 000000000000000e
D0923 16:12:40.335669 10407 task_log.go:94] [ 1: 1] R24 = 0000e48d913a4a50
D0923 16:12:40.335672 10407 task_log.go:94] [ 1: 1] R25 = 000000000000000e
D0923 16:12:40.335675 10407 task_log.go:94] [ 1: 1] R26 = 0000e48d913a4b90
D0923 16:12:40.335678 10407 task_log.go:94] [ 1: 1] R27 = 0000f3728b0d5bb8
D0923 16:12:40.335681 10407 task_log.go:94] [ 1: 1] R28 = 0000f3728b008c58
D0923 16:12:40.335684 10407 task_log.go:94] [ 1: 1] R29 = 0000e48d913948c0
D0923 16:12:40.335687 10407 task_log.go:94] [ 1: 1] R3 = 0000000000000000
D0923 16:12:40.335690 10407 task_log.go:94] [ 1: 1] R30 = 0000f3728b5e28ec
D0923 16:12:40.335693 10407 task_log.go:94] [ 1: 1] R4 = 0000000000000020
D0923 16:12:40.335696 10407 task_log.go:94] [ 1: 1] R5 = 6a736f6e00000000
D0923 16:12:40.335699 10407 task_log.go:94] [ 1: 1] R6 = 1f73726474706451
D0923 16:12:40.335702 10407 task_log.go:94] [ 1: 1] R7 = 7f7f7f7f7f7f7f7f
D0923 16:12:40.335705 10407 task_log.go:94] [ 1: 1] R8 = 000000000000003f
D0923 16:12:40.335708 10407 task_log.go:94] [ 1: 1] R9 = 0000e48d913919f0
D0923 16:12:40.335711 10407 task_log.go:94] [ 1: 1] Sp = 0000e48d913948c0
D0923 16:12:40.335714 10407 task_log.go:94] [ 1: 1] Tls = 0000f3728bbbd7c0
@avagin @konstantin-s-bogom I will continue doing more investigations, but based on above observation, I have 2 quick questions:
read
syscall populates data into the buffer starting at address 0xe48d913949e8
. However, immediately after the syscall completes, a page fault occurred at address 0xe48d91394000
, which seems to be in the same page as the buffer. Is this normal behavior? According to my understanding, ideally the previous read
system call should have already loaded the page from physical memory, so the page fault should have never occurred. Please correct me if I am wrong.Our current suspicion is that c7 (NEOVERSEV1) introduced some new feature such as CCIX which might have caused this issue. But we are still investigating. In the meantime please let me know if you have idea on what's going on here. Thanks!
@avagin
Ok, I think I found a quick (hacky) fix to this issue.
From above example, it seems the page fault happens at the exact location of the Stack Pointer. This could be a race condition, or an edge case that Page Fault happens right after a syscall.
However, I saw this faulty address was already part of both VMA and PMA, which means a previous fault already should have loaded the memory from physical memory. However we are still getting the fault here again, at a awkward time when the reg[0] is reset to 0.
By looking at the code, I think the problem is https://github.com/google/gvisor/blob/master/pkg/sentry/mm/syscalls.go#L68. When we get user page fault, we are populating PMAs for the VMA entry, then calls mmap
to ask native kernel to allocate the memory. However, since we are specifying memmap.PlatformEffectDefault
here, it is still a soft
memory allocation, and it does not populate the memory immediately. Thus causing the situation where an address already in PMA still causing page fault later.
After hacking the code to change memmap.PlatformEffectDefault
to memmap.PlatformEffectCommit
, all issues mentioned above are gone.
Is it intentional that we use memmap.PlatformEffectDefault
here, even it can potentially cause another page fault at same address at later point? My current suspicion is that for c6 instances, mmap
without MAP_POPULATE
flag would take shorter time for kernel to populate the memory, while it takes longer on c7 instances. Therefore, on c7 instances, it is more frequent to see such weird page faults, which would reset the reg[0] to 0, erasing the return value of a previous system call. But please correct me if my understanding is incorrect. Thanks!
The read syscall populates data into the buffer starting at address 0xe48d913949e8. However, immediately after the syscall completes, a page fault occurred at address 0xe48d91394000, which seems to be in the same page as the buffer. Is this normal behavior? According to my understanding, ideally the previous read system call should have already loaded the page from physical memory, so the page fault should have never occurred. Please correct me if I am wrong.
Yes, it can be normal. The Sentry and user code are running in different processes on the host.
Usually a page fault does not reset the reg[0], as it does not need this value at all. Do you know who might be resetting the register value back to 0?
It must not reset the reg[0] and honestly I am not sure it is really happening. It is hard to read your logs without knowing all additional changed you have made. I think you need to print registers before resuming a stub process and after returning back to the sentry.
It must not reset the reg[0] and honestly I am not sure it is really happening. It is hard to read your logs without knowing all additional changed you have made. I think you need to print registers before resuming a stub process and after returning back to the sentry.
Yes here are the register values before returning to stub process and after switching back to sentry process due to page fault (On another run):
t.doSyscall()
for the read
syscallHandleUserFault
call.D0923 18:40:32.305743 1783 task_log.go:87] [ 1: 1] Registers:
D0923 18:40:32.305756 1783 task_log.go:94] [ 1: 1] Pc = 0000edc9ae92a904
D0923 18:40:32.305761 1783 task_log.go:94] [ 1: 1] Pstate = 0000000080001000
D0923 18:40:32.305765 1783 task_log.go:94] [ 1: 1] R0 = 0000000000000284
D0923 18:40:32.305769 1783 task_log.go:94] [ 1: 1] R1 = 0000f69268c969e8
D0923 18:40:32.305772 1783 task_log.go:94] [ 1: 1] R10 = 0000f69268c939f0
D0923 18:40:32.305775 1783 task_log.go:94] [ 1: 1] R11 = 0000000000141530
D0923 18:40:32.305778 1783 task_log.go:94] [ 1: 1] R12 = 00000000000000c0
D0923 18:40:32.305781 1783 task_log.go:94] [ 1: 1] R13 = 0000000000000008
D0923 18:40:32.305785 1783 task_log.go:94] [ 1: 1] R14 = 00000000000002a4
D0923 18:40:32.305788 1783 task_log.go:94] [ 1: 1] R15 = 000000000000000a
D0923 18:40:32.305791 1783 task_log.go:94] [ 1: 1] R16 = 0000000001232a78
D0923 18:40:32.305794 1783 task_log.go:94] [ 1: 1] R17 = 0000edc9ae92a890
D0923 18:40:32.305797 1783 task_log.go:94] [ 1: 1] R18 = ffffffffffffffff
D0923 18:40:32.305800 1783 task_log.go:94] [ 1: 1] R19 = 000000000000000e
D0923 18:40:32.305803 1783 task_log.go:94] [ 1: 1] R2 = 0000000000010000
D0923 18:40:32.305806 1783 task_log.go:94] [ 1: 1] R20 = 0000000000010000
D0923 18:40:32.305809 1783 task_log.go:94] [ 1: 1] R21 = 0000f69268c969e8
D0923 18:40:32.305812 1783 task_log.go:94] [ 1: 1] R22 = 0000edc9aef0e7c0
D0923 18:40:32.305815 1783 task_log.go:94] [ 1: 1] R23 = 000000000000000e
D0923 18:40:32.305818 1783 task_log.go:94] [ 1: 1] R24 = 0000f69268ca6a50
D0923 18:40:32.305821 1783 task_log.go:94] [ 1: 1] R25 = 000000000000000e
D0923 18:40:32.305824 1783 task_log.go:94] [ 1: 1] R26 = 0000f69268ca6b90
D0923 18:40:32.305827 1783 task_log.go:94] [ 1: 1] R27 = 0000edc9ae2d5878
D0923 18:40:32.305830 1783 task_log.go:94] [ 1: 1] R28 = 0000edc9ae208c58
D0923 18:40:32.305833 1783 task_log.go:94] [ 1: 1] R29 = 0000f69268c968c0
D0923 18:40:32.305836 1783 task_log.go:94] [ 1: 1] R3 = 0000000000000000
D0923 18:40:32.305840 1783 task_log.go:94] [ 1: 1] R30 = 0000edc9ae92a8ec
D0923 18:40:32.305843 1783 task_log.go:94] [ 1: 1] R4 = 0000000000000020
D0923 18:40:32.305846 1783 task_log.go:94] [ 1: 1] R5 = 6a736f6e00000000
D0923 18:40:32.305850 1783 task_log.go:94] [ 1: 1] R6 = 1f73726474706451
D0923 18:40:32.305853 1783 task_log.go:94] [ 1: 1] R7 = 7f7f7f7f7f7f7f7f
D0923 18:40:32.305856 1783 task_log.go:94] [ 1: 1] R8 = 000000000000003f
D0923 18:40:32.305859 1783 task_log.go:94] [ 1: 1] R9 = 0000f69268c939f0
D0923 18:40:32.305862 1783 task_log.go:94] [ 1: 1] Sp = 0000f69268c968c0
D0923 18:40:32.305865 1783 task_log.go:94] [ 1: 1] Tls = 0000edc9aef0e7c0
D0923 18:40:32.306674 1783 task_log.go:87] [ 1: 1] Registers:
D0923 18:40:32.306683 1783 task_log.go:94] [ 1: 1] Pc = 0000edc9ae92a904
D0923 18:40:32.306687 1783 task_log.go:94] [ 1: 1] Pstate = 0000000080001000
D0923 18:40:32.306690 1783 task_log.go:94] [ 1: 1] R0 = 0000000000000000
D0923 18:40:32.306693 1783 task_log.go:94] [ 1: 1] R1 = 0000f69268c969e8
D0923 18:40:32.306696 1783 task_log.go:94] [ 1: 1] R10 = 0000f69268c939f0
D0923 18:40:32.306699 1783 task_log.go:94] [ 1: 1] R11 = 0000000000141530
D0923 18:40:32.306702 1783 task_log.go:94] [ 1: 1] R12 = 00000000000000c0
D0923 18:40:32.306705 1783 task_log.go:94] [ 1: 1] R13 = 0000000000000008
D0923 18:40:32.306708 1783 task_log.go:94] [ 1: 1] R14 = 00000000000002a4
D0923 18:40:32.306712 1783 task_log.go:94] [ 1: 1] R15 = 000000000000000a
D0923 18:40:32.306715 1783 task_log.go:94] [ 1: 1] R16 = 0000000001232a78
D0923 18:40:32.306718 1783 task_log.go:94] [ 1: 1] R17 = 0000edc9ae92a890
D0923 18:40:32.306721 1783 task_log.go:94] [ 1: 1] R18 = ffffffffffffffff
D0923 18:40:32.306724 1783 task_log.go:94] [ 1: 1] R19 = 000000000000000e
D0923 18:40:32.306727 1783 task_log.go:94] [ 1: 1] R2 = 0000000000010000
D0923 18:40:32.306730 1783 task_log.go:94] [ 1: 1] R20 = 0000000000010000
D0923 18:40:32.306733 1783 task_log.go:94] [ 1: 1] R21 = 0000f69268c969e8
D0923 18:40:32.306736 1783 task_log.go:94] [ 1: 1] R22 = 0000edc9aef0e7c0
D0923 18:40:32.306739 1783 task_log.go:94] [ 1: 1] R23 = 000000000000000e
D0923 18:40:32.306742 1783 task_log.go:94] [ 1: 1] R24 = 0000f69268ca6a50
D0923 18:40:32.306745 1783 task_log.go:94] [ 1: 1] R25 = 000000000000000e
D0923 18:40:32.306748 1783 task_log.go:94] [ 1: 1] R26 = 0000f69268ca6b90
D0923 18:40:32.306751 1783 task_log.go:94] [ 1: 1] R27 = 0000edc9ae2d5878
D0923 18:40:32.306754 1783 task_log.go:94] [ 1: 1] R28 = 0000edc9ae208c58
D0923 18:40:32.306757 1783 task_log.go:94] [ 1: 1] R29 = 0000f69268c968c0
D0923 18:40:32.306759 1783 task_log.go:94] [ 1: 1] R3 = 0000000000000000
D0923 18:40:32.306763 1783 task_log.go:94] [ 1: 1] R30 = 0000edc9ae92a8ec
D0923 18:40:32.306766 1783 task_log.go:94] [ 1: 1] R4 = 0000000000000020
D0923 18:40:32.306769 1783 task_log.go:94] [ 1: 1] R5 = 6a736f6e00000000
D0923 18:40:32.306771 1783 task_log.go:94] [ 1: 1] R6 = 1f73726474706451
D0923 18:40:32.306775 1783 task_log.go:94] [ 1: 1] R7 = 7f7f7f7f7f7f7f7f
D0923 18:40:32.306778 1783 task_log.go:94] [ 1: 1] R8 = 000000000000003f
D0923 18:40:32.306781 1783 task_log.go:94] [ 1: 1] R9 = 0000f69268c939f0
D0923 18:40:32.306784 1783 task_log.go:94] [ 1: 1] Sp = 0000f69268c968c0
Tls = 0000edc9aef0e7c0
I have not made any changes besides adding logs. But as we can see here, the register values are exactly the same before and after, the only difference is reg[0]
being reset to 0
. I tried to find the exact place where the register reset is happening, but failed to do so...
Also, in this case the address that caused user fault is 0xf69268c968c0
, which is exactly at the stack pointer address. Please see my comments in https://github.com/google/gvisor/issues/10900#issuecomment-2369268581, this address is already part of existing PMA and VMA entries.
I wonder if reg[0] is reset to 0 by CPU, or the systrap signal handler, but I can't find any evidence of it.
Please let me know if you need additional information from me. Thanks!
NOTE: I tried to run the same workload on c6 instance with systrap, and c7 instance with ptrace. Both cases had no issues, and I do not see the page fault right after read
syscall on them. I guess the discrepancy is the page fault itself. But would still be good if we figure out why the register is reset.
@sfc-gh-jyin Can you give access to a test vm?
@avagin Unfortunately this issue currently only happens on c7 with our pre-configured kernel with version 5.4. However, I would be happy to share all kernel configs and gvisor logs with you if needed. I am also trying to see if this can be reproduced in a regular c7 nodes.
I spent some time looking deeper into those page faults mentioned above, and found out those SIGSEGV
are not actually for page faults. The signal code for those weird SIGSEGV
is 2, which represents SEGV_ACCERR
. This means that the cpu correctly recognizes the mapping (Not CPU TLB cache issue), but somehow it thought the caller does not have sufficient permission to act on it. In normal runs, I do see SEGV_ACCERR
happening, but only immediately follow a SEGV_MAPERR
so we know this is a write page fault. But in this case SEGV_ACCERR
happens without prior SEGV_MAPERR
.
For example, the page is already in PMA: f66d82400000-f66d82484000 rw-p 00054000 *pgalloc.MemoryFile
, but it issues SIGSEGV
with SEGV_ACCERR
: D0925 00:18:19.049054 6619 task_run.go:328] [ 1: 1] Encountered user fault: addr=f66d82483a20 ip=dfd52fe03904 access=r-- sig=11 err=interrupted by signal code=2
. From the look of it, it does not seem to be a permission issue, as the map already has read permission on it. But if we add MAP_POPULATE
flag to the mmap
calls, we no longer see SEGV_ACCERR
.
Also it seems those SEGV_ACCERR
self-recovers soon after. Do you know if any of the cpu/kernel feature which might result into this behavior?
@sfc-gh-jyin I was able to reproduce the issue. In my case, the root cause is that systrap doesn't save/restore sve states. Could you try out https://github.com/avagin/gvisor/commit/6927283dc7bf4406d7ecbe0551a9f84c7b473269? It works for me.
Wow that's great news, thank you, @avagin! Let me try it out today!
Do you mind sharing how you reproduced this issue? For some reason I was still not able to reproduce it with simple program
UPDATE: I've tried the patch, it solves the issue for us! Can we have this change merged to main so we can pull it? Thanks!
It took a while to find a reliable way to reproduce this issue. I run the following command from the gvisor git directory:
~/git/gvisor$ bazel-bin/runsc/runsc_/runsc --rootless --network none --debug-log /tmp/a --debug do git grep xxxx .github/
Thanks, @avagin I can confirm this patch works for us!
Description
Hello,
We are currently working on a benchmark to compare gVisor performance on AWS c6gd instances vs c7gd instances. However, when we are running the same workload, our application works fine in c6gd instance, but failed in c7gd instance.
Basically the issue manifests in two ways on c7gd:
1. Getting file status of a bind-mounted file, log shows it succeeds, but client side shows it is not:
On a bad occurrence, the log looks like following:
On a good occurrence, the log looks like following:
As we can see from above, the
fstatat
syscall succeeded in both good and bad occurrence for/usr/lib/python3.8/lib-dynload
directory, however somehow user space is not receiving correct return value, and throws error showing module not found.2. Allocating memory for new stack, log showing it succeeds, but caller does not think so:
It is failing at https://github.com/bminor/glibc/blob/master/nptl/allocatestack.c#L380, which is pretty clear that user space receives result of above
mmap
as0,
instead0xfb57ab76a000
as what sentry suggests. Not sure if the shared memory region is corrupted or overridden by other writer.In both cases, we are seeing discrepancy between what gVisor sentry sends back and what user space receives.
3. Getting unhandled user fault right after mmap:
Followings are some investigations we have done so far:
Scalable Vector Extension
, but it is unclear how this is related to systrap, as systrap uses general purpose registers to store the syscall result.fstatat
syscalls made on bind-mount files prior to this failing point, and they all work fine.ptrace
Steps to reproduce
We are still working on a simple program to reproduce this issue. But in the meantime please let us know if there is any obvious cause to this issue. Thanks!
runsc version
docker version (if using docker)
No response
uname
5.4.181-99.354
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response