Open michael2012z opened 3 years ago
This looks like it's blocked in write
which looks like it's running into https://github.com/rust-lang/cargo/issues/9739 which can be summarized that due to kernel behavior Cargo (and make
and/or other processes) can deadlock under high load when there's a lot of in-memory pipes created by the kernel. This should be fixed in upstream Linux itself at this point but will likely take a long time to propagate. I believe another fix is to increase user pipe limits for your build machine if you can.
Thanks @alexcrichton .
I tried the Rust source code you shared in #9739, the buffer size showed: "init buffer: 65536". And the file "/proc/sys/fs/pipe-user-pages-soft" content is "16384". Seemingly the pipe size is not very small.
Do you think the patch "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46c4c9d1beb7f5b4cec4dd90e7728720583ee348" helps in my case?
Yes that kernel patch should fix the issue if it's the one I'm thinking of (I'm not sure what else would cause Cargo to be blocked in write
on the jobserver pipe). If you run the program from #9739 it may not be accurate because the bug you're seeing only happens when the system is under heavy load and there's lots of pipes in the system. There was a lot of discussion and programs from #9739 as well though so it sort of depends...
If you can increase /proc/sys/fs/pipe-user-pages-soft
or a similar setting and see if that fixes the issue that would likely pinpoint this as there's nothing really that Cargo can do about this, it's a kernel level thing we don't really have control over. You could try using -j1
and that may fix things but that's a bit of a bummer too.
Problem
cargo test
randomly hang inCloud Hypervisor
AArch64 unit test building stage.Here is a recorded failing job in our CI: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248/
cargo version: cargo 1.55.0 (32da73ab1 2021-08-23)
kernel version: 5.10.0
The problem is highly random, and I saw this problem on only one AArch64 server.
I reproduced the problem and debugged with GDB, here is some observation: gdb.txt
##################### Backtrace of cargo process: #####################
############# Some rustc processes were also seen: ##########################
################## I checked the backtrace of some of them, all like: ###################
Steps
Possible Solution(s)
No response
Notes
No response
Version