CAS very rarely fails not due to a race, in GHC >= 8.2

This may be the same as https://github.com/rrnewton/haskell-lockfree/issues/69 but I decided it would be best to report a new bug since my test case is different.

In https://github.com/jberryman/unagi-chan I have some quite paranoid and specific tests for properties I care about in atomic-primops. I've come back to the library after a while and find that one of those tests failed on CI (which I just set up).

It's here and looks like:

-- Test these assumptions:
--   1) If a CAS fails in thread 1 then another CAS (in thread 2, say) succeeded; i.e. no false negatives
--   2) In the case that thread 1's CAS failed, the ticket returned with (False,tk) will contain that newly-written value from thread 2
testNextistentSuccessFailure :: IO ()
testNextistentSuccessFailure = do
    var <- newIORef "0"

    sem <- newIORef (0::Int)
    outs <- replicateM 2 newEmptyMVar 

    _ <- forkSync sem 2 $ test "a" var (outs!!0)
    _ <- forkSync sem 2 $ test "b" var (outs!!1)

    mapM takeMVar outs >>= examine
       -- w/r/t (2) above: we only try to find an element read along with False
       -- which wasn't sent by another thread, which isn't ideal
 where attempts = 100000
       test tag var out = do

         res <- forM [(1::Int)..attempts] $ \x-> do
                    let str = (tag++(show x))
                    tk <- readForCAS var
                    (b,tk') <- casIORef var tk str
                    return (if b then str else peekTicket tk' , b)
         putMVar out res

       examine [res1, res2] = do
         -- any failures in either should be marked as successes in the other
         let (successes1,failures1) = (\(x,y)-> (Set.fromList $ map fst x, map fst y)) $ partition snd res1
             (successes2,failures2) = (\(x,y)-> (Set.fromList $ map fst x, map fst y)) $ partition snd res2
             ok1 = all (flip Set.member successes2) failures1
             ok2 = all (flip Set.member successes1) failures2
         if ok1 && ok2
             then when (length failures1 < (attempts `div` 6) || length failures2 < (attempts `div` 6)) $
                    error "There was not enough contention to trust test. Please retry."
             else do print res1
                     print res2
                     error "FAILURE!"
examine _ = error "Fix testNextistentSuccessFailure"

The idea is to let two threads race, record what happens, and make sure the history comports with our stated properties in the comment above.

changing attempts to 1,000,000 and running repeatedly until failure I can reliably produce an error in ghc 8.6.3 in a few seconds. Here's an example history from one such run:

(27476,("a27936",False))
(27477,("b27477",True))
(27478,("b27478",True))
(27479,("b27479",True))
                             (28246,("a28246",True))
                             (28247,("a28247",True))
(27480,("b27480",True))
                             (28248,("a28248",True))
(27481,("b27481",True))
                          ?  (28249,("a28248",False))
(27482,("b27482",True))
                             (28250,("a28250",True))

(27483,("a28251",False))  x  (28251,("a28251",True)) 
                             (28252,("a28252",True))
(27484,("b27484",True))   x  (28253,("b27484",False))
(27485,("a28254",False))  x  (28254,("a28254",True))

Each column is a separate thread, time moving down, and the tuples mean: (number_payload_attempted, (value_written_if_success_else_val_from_ticket_returned, succeeded?)).

The xs between the columns indicate a couple races (we're trying to generate these). The ? is the problem: it seems to show that with no contention, the right thread performed a CAS which succeeded and then a CAS which failed unexpectedly (according to the returned ticket, the previous value was the one we'd just written).

I did some testing with atomic-primops 0.8.3 and got failures on GHC versions:

8.6.3 (fails in under a minute)
8.2.0.20170507 (first failure took less than a minute, but initially ran for 15 min without failure...)
8.4.3 (failure after 1:30)

GHC 8.0.1 never exhibited the failure, even after running for 2 hours and 300 million test cases. So this may be a regression of some kind.

rrnewton / haskell-lockfree

CAS very rarely fails not due to a race, in GHC >= 8.2 #77