rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.87k stars 2.43k forks source link

`cargo test` command hang in build stage on AArch64 #10007

Open michael2012z opened 3 years ago

michael2012z commented 3 years ago

Problem

cargo test randomly hang in Cloud Hypervisor AArch64 unit test building stage.

Here is a recorded failing job in our CI: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248/

cargo version: cargo 1.55.0 (32da73ab1 2021-08-23)

kernel version: 5.10.0

The problem is highly random, and I saw this problem on only one AArch64 server.

I reproduced the problem and debugged with GDB, here is some observation: gdb.txt

##################### Backtrace of cargo process: #####################

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
__libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
26  ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                  Frame 
* 1    Thread 0xffffa2264ac0 (LWP 665544) "cargo" __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
  2    Thread 0xffff9dea8030 (LWP 666949) "cargo" futex_wait_cancelable (private=0, expected=0, futex_word=0xaaab143ff6dc) at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0xfffeb5dce030 (LWP 671236) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb5dccfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  4    Thread 0xfffeb0da6030 (LWP 673261) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb0da4fb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  5    Thread 0xfffeb05a2030 (LWP 673755) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb05a0fb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  6    Thread 0xfffeb03a1030 (LWP 673756) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb039ffb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  7    Thread 0xfffeb01a0030 (LWP 673852) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb019efb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  8    Thread 0xfffeaff9f030 (LWP 673932) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeaff9dfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
  9    Thread 0xfffeafd9e030 (LWP 673992) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeafd9cfb0, nfds=2, timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:41
(gdb) bt
#0  __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=<optimized out>, buf=0xfffffcee5a7c, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x0000aaaad282df14 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:146
#3  std::sys::unix::fs::File::write () at library/std/src/sys/unix/fs.rs:845
#4  <&std::fs::File as std::io::Write>::write () at library/std/src/fs.rs:680
#5  0x0000aaaad2813b34 in jobserver::imp::Client::release () at library/std/src/panicking.rs:541
#6  0x0000aaaad2814dd0 in <jobserver::Acquired as core::ops::drop::Drop>::drop () at library/std/src/panicking.rs:541
#7  0x0000aaaad21b9da0 in cargo::core::compiler::job_queue::DrainState::drain_the_queue () at library/std/src/panicking.rs:541
#8  0x0000aaaad21f9460 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/std/src/panicking.rs:541
#9  0x0000aaaad21fe4cc in crossbeam_utils::thread::scope () at library/std/src/panicking.rs:541
#10 0x0000aaaad21b8c9c in cargo::core::compiler::job_queue::JobQueue::execute () at library/std/src/panicking.rs:541
#11 0x0000aaaad2411b3c in cargo::core::compiler::context::Context::compile () at library/std/src/panicking.rs:541
#12 0x0000aaaad23692e8 in cargo::ops::cargo_compile::compile_ws () at library/std/src/panicking.rs:541
#13 0x0000aaaad23690f4 in cargo::ops::cargo_compile::compile () at library/std/src/panicking.rs:541
#14 0x0000aaaad23f44a0 in cargo::ops::cargo_test::run_tests () at library/std/src/panicking.rs:541
#15 0x0000aaaad208a1cc in cargo::commands::test::exec () at library/std/src/panicking.rs:541
#16 0x0000aaaad2081abc in cargo::cli::main () at library/std/src/panicking.rs:541
#17 0x0000aaaad208ab6c in cargo::main () at library/std/src/panicking.rs:541
#18 0x0000aaaad20d26f4 in std::sys_common::backtrace::__rust_begin_short_backtrace () at library/std/src/panicking.rs:541
#19 0x0000aaaad208c860 in std::rt::lang_start::{{closure}} () at library/std/src/panicking.rs:541
#20 0x0000aaaad283ca7c in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#21 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#22 std::panicking::try () at library/std/src/panicking.rs:365
#23 std::panic::catch_unwind () at library/std/src/panic.rs:434
#24 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#25 0x0000aaaad208c608 in main () at library/std/src/panicking.rs:541
(gdb) l
21  in ../sysdeps/unix/sysv/linux/write.c
(gdb) 

############# Some rustc processes were also seen: ##########################

root@45e3b95cdbdc:/cloud-hypervisor# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   3612  2564 ?        Ss   03:21   0:00 /bin/bash ./scripts/run_unit_tests.sh --hypervisor kvm
root           9  0.6  0.0 4017272 73332 ?       Sl   03:21   0:14 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/cargo test --target aarch64-unknown-linux-g
root        2986  0.4  0.1 897372 263324 ?       Sl   03:22   0:08 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_packet /usr/local/r
root        4463  0.1  0.0 899068 169872 ?       Sl   03:22   0:02 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name arch --edition=2018 arch
root        4909  0.0  0.0 585264 177472 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name vm_allocator --edition=2
root        4910  0.0  0.0 584320 200320 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name devices --edition=2018 d
root        4987  0.0  0.0 763788 157652 ?       Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet_transport /usr/loca
root        5071  0.0  0.0 591956 230064 ?       Sl   03:22   0:01 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pci --edition=2018 pci/s
root        5115  0.0  0.0 407344 75080 ?        Sl   03:22   0:00 /usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/rustc --crate-name pnet /usr/local/rust/reg
r

################## I checked the backtrace of some of them, all like: ###################

(gdb) attach 673941
Attaching to process 673941
[New LWP 673988]
[New LWP 674062]
[New LWP 674063]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
__pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) bt
#0  __pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1  0x0000ffff9b63832c in std::sys::unix::thread::Thread::join () at library/std/src/sys/unix/thread.rs:177
#2  0x0000ffff9c2284bc in std::thread::JoinHandle<T>::join ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#3  0x0000ffff9c241688 in rustc_interface::util::setup_callbacks_and_run_in_thread_pool_with_globals ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#4  0x0000ffff9c295d3c in rustc_driver::RunCompiler::run ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#5  0x0000ffff9c21fce8 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
   from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#6  0x0000ffff9c29bb0c in rustc_driver::main () from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#7  0x0000aaaaba0e0c30 in rustc_main::main ()
#8  0x0000aaaaba0e0be4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9  0x0000aaaaba0e0bfc in std::rt::lang_start::{{closure}} ()
#10 0x0000ffff9b62da04 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#11 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#12 std::panicking::try () at library/std/src/panicking.rs:365
#13 std::panic::catch_unwind () at library/std/src/panic.rs:434
#14 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#15 0x0000aaaaba0e0c5c in main ()
(gdb)

Steps

  1. git clone https://github.com/cloud-hypervisor/cloud-hypervisor.git
  2. cd cloud-hypervisor
  3. cargo test --target aarch64-unknown-linux-gnu --workspace --no-default-features --features kvm

Possible Solution(s)

No response

Notes

No response

Version

cargo 1.55.0 (32da73ab1 2021-08-23)
alexcrichton commented 3 years ago

This looks like it's blocked in write which looks like it's running into https://github.com/rust-lang/cargo/issues/9739 which can be summarized that due to kernel behavior Cargo (and make and/or other processes) can deadlock under high load when there's a lot of in-memory pipes created by the kernel. This should be fixed in upstream Linux itself at this point but will likely take a long time to propagate. I believe another fix is to increase user pipe limits for your build machine if you can.

michael2012z commented 3 years ago

Thanks @alexcrichton .

I tried the Rust source code you shared in #9739, the buffer size showed: "init buffer: 65536". And the file "/proc/sys/fs/pipe-user-pages-soft" content is "16384". Seemingly the pipe size is not very small.

Do you think the patch "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46c4c9d1beb7f5b4cec4dd90e7728720583ee348" helps in my case?

alexcrichton commented 3 years ago

Yes that kernel patch should fix the issue if it's the one I'm thinking of (I'm not sure what else would cause Cargo to be blocked in write on the jobserver pipe). If you run the program from #9739 it may not be accurate because the bug you're seeing only happens when the system is under heavy load and there's lots of pipes in the system. There was a lot of discussion and programs from #9739 as well though so it sort of depends...

If you can increase /proc/sys/fs/pipe-user-pages-soft or a similar setting and see if that fixes the issue that would likely pinpoint this as there's nothing really that Cargo can do about this, it's a kernel level thing we don't really have control over. You could try using -j1 and that may fix things but that's a bit of a bummer too.