Open tiif opened 1 week ago
@rustbot claim
A quick fix would be removing the unwrap and only use the epoll fd if it still exists, but I think it is better to investigate why the assumption is violated.
This is the test in tokio that ICE'd:
#[test]
fn fs_shutdown_before_started() {
let rt = rt();
let _enter = rt.enter();
rt.shutdown_timeout(Duration::from_secs(1000));
let err: std::io::Error = Handle::current()
.block_on(fs::read_to_string("Cargo.toml"))
.unwrap_err();
assert_eq!(err.kind(), std::io::ErrorKind::Other);
let inner_err = err.get_ref().expect("no inner error");
assert_eq!(inner_err.to_string(), "background task failed");
}
This is the reproducible I wrote to be tested in miri:
use tokio::fs;
use tokio::runtime::Handle;
use tokio::time::Duration;
fn main() {
let rt = tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap();
let _enter = rt.enter();
rt.shutdown_timeout(Duration::from_secs(1000));
let err: std::io::Error =
Handle::current().block_on(fs::read_to_string("Cargo.toml")).unwrap_err();
assert_eq!(err.kind(), std::io::ErrorKind::Other);
let inner_err = err.get_ref().expect("no inner error");
assert_eq!(inner_err.to_string(), "background task failed");
}
The tokio test still ICE on latest nightly rustc 1.83.0-nightly (12b26c13f 2024-09-07)
, but the test I have written passed miri on commit 335c91c
. Does anyone have any idea what could possibly caused this?
@Darksonn Could you help to take a look at this? I am not sure if my reproducible is equivalent to the original tokio test.
It might be kind of random if it depends on the exact scheduler behavior. A particular Miri version is deterministic, but it is not deterministic across multiple versions. Different nightlies means you explore different interleavings. On the nightly where your example managed to cause the ICE, does it also cause the ICE with -Zmiri-preemption-rate=0.0
?
You can try running cargo miri test --many-seeds
to explore a bunch of different interleavings.
The goal should be to get a reproducer that doesn't use tokio, but directly uses epoll. But that is probably best done by actually understanding what happens, and then writing code to specifically hit that corner case.
This is a Miri issue so I doubt tokio people will have a good idea of how to get to the bottom of this (but I may be wrong ofc :D ). It needs someone who understands the internal Miri implementation. Try adding more and more tracing to Miri until you understand where that invariant gets broken.
Why does EpollEventInterest even store an FD number to begin with? That's a relatively fragile identifier that we should generally never use inside Miri. Why not store a (Weak)FileDescriptionRef?
For instance, one could use dup
to give a second FD number to an epoll instance, and then close the original epoll instance -- then the FD number in EpollEventInterest would be outdated but the epoll instance is still perfectly alive and running.
Thanks for the suggestions.
On the nightly where your example managed to cause the ICE, does it also cause the ICE with -Zmiri-preemption-rate=0.0?
Yes, the ICE still happened after I added that flag.
You can try running cargo miri test --many-seeds to explore a bunch of different interleavings.
For the test I have written, all seeds range from 0 to 63 won't ICE.
I think it might help to check if I somewhat got the rt()
runtime configuration wrong, or tokio implicitly uses epoll
somewhere in the test configuration. But I will also try to add more tracing in miri. :)
For instance, one could use dup to give a second FD number to an epoll instance, and then close the original epoll instance -- then the FD number in EpollEventInterest would be outdated but the epoll instance is still perfectly alive and running.
Good point, it is better to use WeakFileDescriptionRef
so we don't need to retrieve Epoll
file description from fd table again and this might also fix the ICE. I think initially fd number is used to prevent circular reference, but this shouldn't happen if we use WeakFileDescriptionRef
.
I willl add a test for this.
I see there are two fd numbers in EpollEventInterest
. Unless there's a really good reason to do something else, they should both be changed to WeakFileDescriptionRef
IMO.
I think the file_descriptor
field can't be replaced. That value is only used as the key of ready_list
, which is a BTreeMap (the only line that uses that value is here). The entry of ready_list needs to be differentiated using both the fd number and FdId. Also, the key of BTreeMap needs to implement PartialOrd
, Ord
, PartialEq
and Eq
, so WeakFileDescriptionRef
can't be used as the key.
Okay, that field should then have a comment saying that is is only used to report the FD number back to the user in case of an event.
Ok, now the test can ICE by adding test_util
to the tokio
dependency. Thanks @oli-obk for discovering this! :)
Miri ICE on this Tokio test: https://github.com/tokio-rs/tokio/blob/27539ae3bdfd08b7503f4919fe13299537cc16fc/tokio/tests/rt_handle_block_on.rs#L146. It panics on the unwrap in this line.
I am currently working on getting a reproducible. This
unwrap
will fail when we try to retrieve an epoll instance that is closed from fd table. It supposed to be guarded by theupgrade
ofweak_epoll_interest
, because epoll file description is the only one that holds a strong ref toEpollInterest
, so when the epoll instance is closed, the upgrade ofweak_epoll_interest
should fail. This ICE is a case whereweak_epoll_interest
upgrade succeed, but the epoll file descriptor is closed.Full trace
```rust test current_thread_scheduler::fs_shutdown_before_started ... thread 'rustc' panicked at src/tools/miri/src/shims/unix/linux/epoll.rs:556:87: called `Option::unwrap()` on a `None` value stack backtrace: 0: 0x73b696bfed9a -Version: