rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.35k stars 12.59k forks source link

Panic in `thread::sleep()` on arm32 Debian 10 #95661

Open and-rewsmith opened 2 years ago

and-rewsmith commented 2 years ago

This issue reproduces consistently for multiple arm32 Debian 10 Buster devices (raspberry pi 2b). Running a docker container with ubuntu20.04 base image and calling thread::sleep() will result in the following panic:

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `1`,
right: `4`', library/std/src/sys/unix/thread.rs:217:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I tried this code (simplified):

const PROCESS_POLL_INTERVAL_SECS: Duration = Duration::from_secs(1);
thread::sleep(PROCESS_POLL_INTERVAL_SECS);

Full source code: https://github.com/Azure/iotedge/blob/main/edge-hub/watchdog/src/child.rs#L103

I expected to see this happen: no panic, instead a proper sleep.

Instead, this happened: Attempting to sleep panicked.

Meta

Will fill in the below information in a bit.

rustc --version --verbose:

rustc 1.58.1 (db9d1b20b 2022-01-20)
binary: rustc
commit-hash: db9d1b20bba1968c1ec1fc49616d4742c1725b4b
commit-date: 2022-01-20
host: x86_64-unknown-linux-gnu
release: 1.58.1
LLVM version: 13.0.0
Backtrace

``` ```

saethlin commented 2 years ago

The panicking line is here: https://github.com/rust-lang/rust/blob/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys/unix/thread.rs#L217

From some searching, it looks like real-time OSes can return EPERM from a sleep. What's the host OS here? You said you're running an Ubuntu image, but the point of containers is to share a kernel with the host. What's the host OS here?

and-rewsmith commented 2 years ago

Host OS: Debian 10 Buster

This doesn't happen with Debian 9 or Debian 11.

and-rewsmith commented 2 years ago

@saethlin

You said you're running an Ubuntu image, but the point of containers is to share a kernel with the host.

For the last 1-2 years we have been running ubuntu 18 base image containers on debian 9, 10, and 11 without issue. I know kernels provide backwards compatibility of syscalls across versions, but I am thinking it could be going the wrong way here (trying to run newer kernel ubuntu 20 container on older debian 10 host kernel).

Although there may be some workaround as the panic is coming from this assert you mentioned, not libc itself (although there is some os::errno()). For anyone curious in mitigating this quickly, shelling out to bash sleep will work. It also may work to call nanosleep syscall directly, but I didn't go down this path.

Helpful article: https://forums.docker.com/t/libc-incompatibilities-when-will-they-emerge/9895

the8472 commented 2 years ago

It would be quite silly, but maybe the container has a syscall filter that blocks nanosleep but not clock_nanosleep?

saethlin commented 2 years ago

Interesting! Can you share the output of an strace -f of sleeping via the shell so that we can see if there's some interesting sequence of syscalls or errors there?

MinnDevelopment commented 2 years ago

Could this be an issue with docker filtering nanosleep calls? I found this potentially related stackoverflow post: https://stackoverflow.com/questions/67117744/32-bit-executables-not-sleeping-inside-a-docker-container

If that is the case, then running the container with --privileged or --security-opt=seccomp:unconfined should solve the problem