tokio-rs / loom

Concurrency permutation testing tool for Rust.
MIT License
2.08k stars 110 forks source link

Unclear "Concurrent load and mut accesses" with an atomic counter #260

Open kvark opened 2 years ago

kvark commented 2 years ago

Disclaimer: I think this project is amazingly cool! I'd love to ask this question in a discussion or in a matrix room, but there is no link to either in the project, so I'm filing it here.

Loom complains about this line - https://github.com/kvark/choir/blob/0c7af40b88e3613520eba52e4993000ee5ca2fae/tests/basic.rs#L34

Causality violation: Concurrent load and mut accesses.
    created: tests/basic.rs:27:26
       load: thread #1 @ tests/basic.rs:34:15
', /Users/kvark/.cargo/git/checkouts/loom-1f4be9fd68e6b59b/d0c1e71/src/rt/location.rs:115:9

First problem is that it lists the place an atomic was "created" and "load", but not where it got "mut" access, referenced in the error.

I assume it doesn't like this fetch_add, but I don't know why. It's just an atomic counter.

hawkw commented 2 years ago

Hmm, this certainly looks off...that error should only be emitted if the atomic get_mut method is used, which it appears your test never does. This could be a bug!

kvark commented 2 years ago

@hawkw could you elaborate on how this should ideally work?

It's not clear to me what Loom is supposed to do here. On one hand, this is an atomic doing atomic ops. It's totally thread safe. On the other hand, if an atomic can end up with a different value based on the flow of the program, it's quite likely a bug.

I hope Loom follows the second line of thought. In this case, the current modelling for atomics is too simplistic: the code simply considers fetch_add to be a read-modify-write operation. This means Loom would consider fetch_add colliding with fetch_and equally dangerous as colliding with another fetch_add.

Perhaps, Loom needs to be taught to see that fetch_add can happen with any order versus another fetch_add or fetch_sub (as long as there are no overflows/underflows)?

hawkw commented 2 years ago

Sorry if my first comment wasn't clear! What I meant is, loom only emits "Causality violation" errors when an access would violate the C++11 memory model — that error doesn't mean that there was any kind of bug in the program, it specifically means that a data race would occur.

In particular, a causality violation can only occur when a value is accessed non-atomically. In the case of an AtomicUsize, this specifically means either the with_mut or unsync_load methods. Since a causality violation indicates a data race was detected, it should never be possible to cause a causality violation in purely safe Rust — only unsafe code can potentially violate the memory model.

It's not clear to me what Loom is supposed to do here. On one hand, this is an atomic doing atomic ops. It's totally thread safe. On the other hand, if an atomic can end up with a different value based on the flow of the program, it's quite likely a bug.

Loom does also help you catch bugs other than data races. It does this by simulating all the possible flows through the program where the atomic may have different values. If a race condition on an atomic (i.e., not a data race) would cause incorrect behavior, this would be caught by assertions in the test or in the code under test, by Loom's deadlock and memory leak checking, or by causing a causality violation if the atomic controls access to an UnsafeCell or similar. Interacting with an atomic using only atomic operations can still have bugs that Loom can find, but those bugs are not causality violations.

Therefore, since you only call fetch_add and load on the atomic, this code should never result in a causality violation error, so the error is definitely incorrect.

Does that help to clarify what I meant?

hawkw commented 2 years ago

Also, looking at the code, the only place where an atomic will print the specific error message "concurrent load and mut accesses"[^1] is here: https://github.com/tokio-rs/loom/blob/6d159963a1dafe961478b549a71d1409af712485/src/rt/atomic.rs#L568-L582 which is only called inside of the with_mut method. In addition, the error message should include the location of the with_mut call that caused the mutable access. So, it really looks like this particular error was emitted incorrectly.

@kvark what version of loom are you using? is it possible your dependency is out of date? Edit: ah, I see that you're using the latest HEAD commit: https://github.com/kvark/choir/commit/0c7af40b88e3613520eba52e4993000ee5ca2fae#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542R24 ... so that's even weirder; I actually don't see anywhere where that specific panic should be emitted without a with_mut call in the current HEAD.

[^1]: other causality violation messages are also printed by atomics but they have slightly different text.

kvark commented 2 years ago

@hawkw please accept my greatest appreciation to the detail in which you described the issue. I'll try to debug it a bit more and see where Loom may behave erroneously.

kvark commented 2 years ago

So track_load is indeed one that fires it. However, it's not only called from get_mut. It's called from rmw(). This one is called from Atomic::try_rmw. This is called essentially by all atomic instructions.

So, from the code standpoint, it doesn't look like an oversight, but rather an architectural bug/feature.

hawkw commented 2 years ago

Hmm, it's strange that your panic is missing a with_mut location, which the code in track_load() should add to the error message. That's what's confusing me about this...

hawkw commented 2 years ago

As a side note, I recently hit the same error in some code of my own. I think what's happening here is that loom is (perhaps incorrectly?) treating the creation of the atomic as a mutable access?

hawkw commented 2 years ago

Oh, okay, I've figured something out. It looks like one of the two locations is not being printed because no location was captured for a write on that thread...which means that thread never actually started a write. But, somehow, the model believes that thread has accessed the atomic mutably. I think there's definitely some kind of bug.

RustyYato commented 1 year ago

I'm running into this bug as well, I'm not using with_mut or load_unsync anywhere in my code-base, so it's clearly not that. When I turn on locations, I see that it's reporting Atomic*::new and Atomic*::load as conflicting, which makes no sense to me. Is there some way I can help to fix this bug? Or a workaround that I can use to get around it? (like skipping that iteration).

loyd commented 5 months ago

Has anyone found a workaround for this problem?

loyd commented 5 months ago

I tried to put together a minimal example for my case:

#[test]
fn loom_260() {
    use loom::sync::{atomic::{AtomicPtr, AtomicU32, Ordering::*}, Mutex};

    // A naive concurrent slab
    struct Slots {
        head: AtomicU32,
        slots: AtomicPtr<Vec</* next */ AtomicU32>>, // lazily allocated
        lock: Mutex<()>,
    }

    impl Slots {
        fn push(&self) -> u32 {
            let mut slots = self.slots.load(Relaxed); // `Acquire` fixes the issue <<<
            if slots.is_null() {
                self.try_alloc();
                slots = self.slots.load(Acquire);
            }

            let slots = unsafe { slots.as_ref().unwrap() };

            let mut head = self.head.load(Acquire);
            loop {
                let next_head = slots[head as usize].load(Acquire);
                match self.head.compare_exchange(head, next_head, AcqRel, Acquire) {
                    Ok(_) => return head,
                    Err(new) => head = new,
                }
            }
        }

        fn try_alloc(&self) {
            let _guard = self.lock.lock().unwrap();
            if !self.slots.load(Acquire).is_null() {
                return;
            }

            let slots = (0..4).map(|i| AtomicU32::new(i + 1)).collect();
            let ptr = Box::into_raw(Box::new(slots));
            self.slots.store(ptr, Release);
        }
    }

    loom::model(|| {
        let slots = Arc::new(Slots {
            head: AtomicU32::new(0),
            slots: AtomicPtr::new(std::ptr::null_mut()),
            lock: Mutex::new(()),
        });

        let slots1 = slots.clone();
        let t1 = thread::spawn(move || slots1.push());
        slots.push();
        t1.join().unwrap();
    });
}

Although the message is unclear, the loom actually "detects" something in this case. I wonder why this code is considered incorrect. Is this an imperfection of the loom, or it's a real problem according to the atomic model?

RustyYato commented 5 months ago

I think I'll try my hand at this, since there hasn't been any response in months. I can reproduce the bug on main, and the conflict is between AtomicXXX::new and AtomicXXX::load, which should never conflict. (Especially because there is a Acquire-Release happens before relation via Slots::slots).

Let me see if I can figure out where loom is going wrong :smile:

NOTE: adding fence(Acquire); after the load on slots, and fence(Release); before the store on slots does fix the bug, which shows that loom isn't correctly handling loads/stores here.

RustyYato commented 5 months ago

Ignore this

see my next comment, this one was based on the assumption that the example above was correct, which it isn't I think tracking `AtomicXXX::new` as an `unsync_mut` is wrong, or this join is wrong https://github.com/tokio-rs/loom/blob/6a03bf052ad074bd13552c1654c012f9f4072b6e/src/rt/atomic.rs#L711 Removing that one line, and all loom's tests still pass... but that doesn't seem right since this is basically just disabling the check all together
RustyYato commented 5 months ago

@loyd Wait a minute,

let mut slots = self.slots.load(Relaxed); // `Acquire` fixes the issue <<<

This load does need to be Acquire... so it synchronizes with the store in try_alloc. So it looks like in this case loom did catch a bug correctly. It looks like you tried to implement double-checked locks but got the orderings flipped around. (see the reference in Wikipedia for C++ 11)

The load under the mutex can be Relaxed, but the one outside the mutex must be Acquire.

Not sure why I didn't see that sooner :upside_down_face:

loyd commented 5 months ago

@RustyYato, thanks for the quick response and I apologize for the time spent.

I'm not sure why this is the case or what can be done to disrupt the correctness of the code, but apparently, this is a real bug, and thanks to Loom for catching that. Anyway, it would be nice to have a better message instead of only created and load.