Valgrind always crashes Rust programs on FreeBSD with "failed to allocate a guard page"

asomers commented 3 years ago

Every Rust program will crash on FreeBSD when run with Valgrind with the error "failed to allocate a guard page". This affects literally every single Rust program. For example, ripgrep. It affects every tool: memcheck, cachegrind, callgrind, helgrind, drd, massif, lackey, exp-bbv, and even none.

STEPS TO REPRODUCE

pkg install ripgrep
valgrind --tool=callgrind /usr/local/bin/rg

OBSERVED RESULT $ valgrind --tool=memcheck /usr/local/bin/rg ==10062== Memcheck, a memory error detector ==10062== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==10062== Using Valgrind-3.17.0.GIT and LibVEX; rerun with -h for copyright info ==10062== Command: /usr/local/bin/rg ==10062== thread '' panicked at 'failed to allocate a guard page', library/std/src/sys/unix/thread.rs:364:17 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace fatal runtime error: failed to initiate panic, error 5 ==10062== ==10062== Process terminating with default action of signal 6 (SIGABRT): dumping core

EXPECTED RESULT The program should run normally.

SOFTWARE/OS VERSIONS FreeBSD. Reproduced on 14.0-CURRENT, 12.2-RELEASE, 11.2-RELEASE, and 11.4-RELEASE amd64. Reproduced with Valgrind 3.10.1 and 3.17.0.GIT,

ADDITIONAL INFORMATION

Rust bug entry. The Rust team believes this to be a Valgrind bug, however. https://github.com/rust-lang/rust/issues/67153

Rust code that allocates the guard page on startup of every program. https://doc.rust-lang.org/src/std/sys/unix/thread.rs.html#346

paulfloyd commented 3 years ago

Here are some traces

SYSCALL6127,1 sys___sysctlbyname ( 0x491113d(kern.sched.cpusetsize), 0x15, 0x4935740, 0x7fc0002f8, 0 )[sync] --> Success(0x0) SYSCALL6127,1 sys_mmap ( 0x0, 4096, 3, 4098, 4294967295, 0x0)--6127-- di_notify_mmap-0: --6127-- di_notify_mmap-1: 0x4b20000-0x4d5dfff rw- --> [pre-success] Success(0x4d5d000) SYSCALL6127,1 sys_cpuset_getaffinity ( 3, 1, 101876, 32, 0x4d5d000 )[sync] --> Success(0x0) SYSCALL6127,1 sys_mmap ( 0x7fffdffff000, 4096, 3, 4114, 4294967295, 0x0) --> [pre-fail] Failure(0x16) SYSCALL6127,1 sys_mmap ( 0x0, 131072, 3, 4098, 4294967295, 0x0)--6127-- di_notify_mmap-0: --6127-- di_notify_mmap-1: 0x4b20000-0x4d7dfff rw- --> [pre-success] Success(0x4d5e000) SYSCALL6127,1 sys_mmap ( 0x0, 12288, 3, 4098, 4294967295, 0x0)--6127-- di_notify_mmap-0: --6127-- di_notify_mmap-1: 0x4b20000-0x4d80fff rw- --> [pre-success] Success(0x4d7e000) SYSCALL6127,1 sys_mmap ( 0x0, 4096, 3, 4098, 4294967295, 0x0)--6127-- di_notify_mmap-0: --6127-- di_notify_mmap-1: 0x4b20000-0x4d81fff rw- --> [pre-success] Success(0x4d81000) SYSCALL6127,1 sys_mmap ( 0x0, 20480, 3, 4098, 4294967295, 0x0)--6127-- di_notify_mmap-0: --6127-- di_notify_mmap-1: 0x4b20000-0x4d86fff rw- --> [pre-success] Success(0x4d82000)

The failure is in bold.

It will take me a while to figure out what the rust application is mmaping, how that differs from C/C++ applications.

The error code looks like EINVAL

SYSCALL6127,1 sys_mmap ( 0x7fffdffff000, 4096, 3, 4114, 4294967295, 0x0) --> [pre-fail] Failure(0x16)

Other than demangling there is little in the way of Rust specific code in Valgrind.

Looking a bit at the VG code, the failure is here

   if (forClient && req->rkind == MFixed) {
      Int  iLo   = find_nsegment_idx(reqStart);
      Int  iHi   = find_nsegment_idx(reqEnd);
      Bool allow = True;
      for (i = iLo; i <= iHi; i++) {
         if (nsegments[i].kind == SkFree
             || nsegments[i].kind == SkFileC
             || nsegments[i].kind == SkAnonC
             || nsegments[i].kind == SkShmC
             || nsegments[i].kind == SkResvn) {
            /* ok */
         } else {
            allow = False;
            VG_(printf)("in advisory about to go bad, kind %d\n", (int)nsegments[i].kind );
            break;
         }
      }
      if (allow) {
         /* Acceptable.  Granted. */
         *ok = True;
         return reqStart;
      }
      /* Not acceptable.  Fail. */
      VG_(printf)("in advisory bad 0\n");
      *ok = False;
      return 0;
   }

with a few added printfs. The kind is 4 which is SkAnonV = 0x04, // anonymous mapping belonging to valgrind

So the guest is trying to mmap to anon space reserved for the host.

paulfloyd commented 3 years ago

Not surprisingly, no useful stack info from the guest stack

(gdb) p vgPlain_get_and_pp_StackTrace(0, 6)
==58190==    at 0x0: ???
$4 = void
(gdb) p vgPlain_get_and_pp_StackTrace(1, 6)
==58190==    at 0x4B06C2A: thr_kill (in /lib/libc.so.7)
==58190==    by 0x4B05083: raise (in /lib/libc.so.7)
==58190==    by 0x4A7B278: abort (in /lib/libc.so.7)
==58190==    by 0x521D19: ??? (in /usr/local/bin/rg)
==58190==    by 0x510B3F: ??? (in /usr/local/bin/rg)
==58190==    by 0x51AA73: ??? (in /usr/local/bin/rg)

paulfloyd commented 3 years ago

Installing the rust package and building hello world with debug info gives

=59160== Process terminating with default action of signal 6 (SIGABRT): dumping core
==59160==    at 0x4A4AC2A: thr_kill (in /lib/libc.so.7)
==59160==    by 0x49BF278: abort (in /lib/libc.so.7)
==59160==    by 0x13C599: std::sys::unix::abort_internal (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x13562F: std::sys_common::util::abort (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x1393A3: rust_panic (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x13930B: std::panicking::rust_panic_with_hook (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x129115: std::panicking::begin_panic::{{closure}} (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x128A3F: std::sys_common::backtrace::__rust_end_short_backtrace (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x1390EE: std::panicking::begin_panic (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x131AA6: std::sys::unix::thread::guard::init (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x1393DA: std::rt::lang_start_internal (in /usr/home/paulf/scratch/vg_rust/hello)
==59160==    by 0x11CA91: std::rt::lang_start (rt.rs:65)

paulfloyd commented 3 years ago

This doesn't look like it will be that easy to fix. The problem is that we have two functions:

    pthread_attr_get_np(pthread_self(), &attr);
    pthread_attr_getstack(&attr, &stackaddr, &stacksize);

In the fist function, we know the tid so we can tell if it is the main thread or not. BUT we don't want to mess with attr In the second we can't tell if it is the primary thread or not.

asomers commented 3 years ago

Do you know why it's an issue for Rust but not for C?

paulfloyd commented 3 years ago

Probably because the C startup code isn't trying to add a guard page (or at least doing so differently).

paulfloyd commented 3 years ago

Fixed with the latest push

commit 592323706b66dbf73e739d89da2d52cd65c0a34f Author: Paul Floyd pjfloyd@wanadoo.fr Date: Wed Apr 7 08:37:20 2021 +0200

Modify the value returned by the kern.usrstack sysctl to reflect the
user stack that Valgrind synthesizes for the guest. Without this change
the sysctl will return the stack of the Valgrind host. This manifested itself
as a problem on rust compiled binaries, which were trying to add an extra
guard page but were failing since Valgrind refused guest mmaps into what it
considered to be its own memory space.

asomers commented 3 years ago

Thanks Paul!

paulfloyd / freebsd_valgrind

Valgrind always crashes Rust programs on FreeBSD with "failed to allocate a guard page" #154