haskell-effectful / effectful

An easy to use, fast extensible effects library with seamless integration with the existing Haskell ecosystem.
BSD 3-Clause "New" or "Revised" License
369 stars 28 forks source link

cloneEnv intermittently tries to relink a deleted effect #243

Open michaelpj opened 4 weeks ago

michaelpj commented 4 weeks ago

We've been running with master for a bit in order to get the fix for https://github.com/haskell-effectful/effectful/issues/237. I think we may have hit another bug?

I don't have a reproducer, but we're getting the following error:

undefined data
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:428:17 in effectful-core-2.4.0.0-BiuUQUZzpHcJ16lkaBGrSU:Effectful.Internal.Env
  undefinedData, called at src/Effectful/Internal/Env.hs:413:26 in effectful-core-2.4.0.0-BiuUQUZzpHcJ16lkaBGrSU:Effectful.Internal.Env

From a very cursory look, I guess this could occur if some parts of the environment are not initialized properly, which could conceivably have been introduced with the recent fix PRs? I can try and debug more, but it's not easy to reproduce. Perhaps a good step would just to be to add some more HasCallStacks?

arybczak commented 4 weeks ago

Hmm, this is very weird. undefinedData, called at src/Effectful/Internal/Env.hs:413:26 is in deleteEffect. So that means that you're trying to access an effect that went out of scope. However, there are additional version checks (in getLocation) that should fail before.

Can you try with https://github.com/haskell-effectful/effectful/tree/hascallstack-runners? This should at least give you a location which deleted effect it tries to use.

michaelpj commented 4 weeks ago

Great, that's helpful. I'll try and do it but it might be a bit since it only happens in production (:see_no_evil: ).

In case it helps, here are a few possibly relevant facts:

  1. The function that triggers this is being run for the second time. So it worked once without failure; I think something must have modified its environment?
  2. The function that triggers this is being run in parallel with a second thread. Both of them are running with cloned environments from the parent thread.

So conceivably something is going wrong with cloning?

Anyway, I'll try and run with your branch and report back.

arybczak commented 4 weeks ago

since it only happens in production (:see_no_evil: ).

Ugh, my condolences.

Ok, if this is a concurrent code and is hard to reproduce, that explains it. It looks like you have two threads using Env that hasn't been cloned and thus uses the same underlying Storage :confused:

import Control.Concurrent
import Control.Concurrent.Async
import Effectful
import Effectful.Reader.Static
import Effectful.Dispatch.Static.Unsafe

works :: IO ()
works = runEff $ do
  x <- runReader () $ do
    withEffToIO (ConcUnlift Ephemeral $ Limited 1) $ \unlift -> do
      x <- liftIO . async $ do
        threadDelay 1000000
        unlift ask
      threadDelay 500000
      pure x
  liftIO $ wait x

fails :: IO ()
fails = runEff $ do
  x <- runReader () $ do
    reallyUnsafeUnliftIO $ \unlift -> do
      x <- liftIO . async $ do
        threadDelay 1000000
        unlift ask
      threadDelay 500000
      pure x
  liftIO $ wait x
>>> works
>>> fails
*** Exception: version (2) /= storageVersion (0)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:360:5 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env

fails tries to use the Reader after it was deleted by the parent thread and it fails, but it still fails on the version check (because it's reset to 0 when an effect is deleted). What I think is happening in your case is there is a race between a thread accessing the effect and the one deleting it:

  1. T1 tries to access the effect E, effectful does version check, it succeeds.
  2. T2 deletes E.
  3. T1 gets the effect from the Storage and it's no longer there since it was deleted by T2.

Very unfortunate. It's likely you use a static effect that wraps an IO-based concurrent API and one of the wrapper functions has a bug, i.e. doesn't properly clone Env when spawning a thread :thinking:

arybczak commented 3 weeks ago

Ok, I merged #244 with a bunch of HasCallStack constraint for better call stacks.

I also pushed https://github.com/haskell-effectful/effectful/tree/thread-sanitizer with a PoC of a primitive thread sanitizer.

master:

>>> fails
*** Exception: version (2) /= storageVersion (0)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:360:5 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getLocation, called at src/Effectful/Internal/Env.hs:309:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getEnv, called at src/Effectful/Internal/Monad.hs:600:35 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Monad
  getStaticRep, called at src/Effectful/Reader/Static.hs:53:15 in effectful-core-2.4.0.0-inplace:Effectful.Reader.Static
  ask, called at tests/Main.hs:53:16 in effectful-2.4.0.0-inplace-test:Main

thread-sanitizer:

>>> fails
*** Exception: Storage used from multiple threads: owner (54) /= current (55)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:456:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  readStorage, called at src/Effectful/Internal/Env.hs:356:28 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getLocation, called at src/Effectful/Internal/Env.hs:311:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getEnv, called at src/Effectful/Internal/Monad.hs:600:35 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Monad
  getStaticRep, called at src/Effectful/Reader/Static.hs:53:15 in effectful-core-2.4.0.0-inplace:Effectful.Reader.Static
  ask, called at tests/Main.hs:53:16 in effectful-2.4.0.0-inplace-test:Main

Basically, whenever you access a fresh Storage, it gets bound to the current thread, so if it ever gets accessed from a different thread, you'll immediately know. If my hypothesis from above post is correct, this should let you identify the problem quickly :thinking:

michaelpj commented 3 weeks ago

Hmmmmm. I don't think we're doing anything weird here. The two threads run together and the main thread waits for both, and I don't think anything changes the environment. But I think it must be something like that.

That thread sanitizer stuff looks amazing! I will try that out too!

michaelpj commented 3 weeks ago

The thread sanitizer did not trigger. Here's a stack trace:

undefined data
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:433:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  undefinedData, called at src/Effectful/Internal/Env.hs:418:26 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  deleteEffect, called at src/Effectful/Internal/Env.hs:224:3 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  unconsEnv, called at src/Effectful/Internal/Monad.hs:526:5 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  runHandler, called at src/Effectful/Dispatch/Dynamic.hs:465:20 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Dispatch.Dynamic
  reinterpret, called at src/Effectful/NonDet.hs:78:21 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.NonDet
  runNonDetRollback, called at src/Effectful/NonDet.hs:60:22 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.NonDet
  runNonDet, called at <our code>

So possibly a bug or gotcha in runNonDetRollback. I think I don't fundamentally need that, so I'm probably going to just try and get rid of it and see if that helps. But maybe somewhere to look? I think it's likely that this is happening when we hit an empty and trigger a rollback.

arybczak commented 3 weeks ago

The thread sanitizer did not trigger

Well, that's one good news at least.

So possibly a bug or gotcha in runNonDetRollback.

Well, it turns out that stack traces that come from undefinedData are deceptive. Apparently, due to how HasCallStack on values works, it points to the place in the code where undefinedData was inserted into the Storage, not where it was accessed :confused: So looks like your code tries to access the NonDet effect that is not there, sigh.

However, I previously misread the trace. The code doesn't try to access the effect data, it tries to access its relinker. So looks like this is triggered inside cloneEnv somewhere that tries to relink a NonDet effect that's not there anymore. I still don't understand how this can happen however :disappointed:

I'm going to merge #247 that will start producing expected call stacks when undefined relinker is called, hopefully this sheds more light on the situation.

michaelpj commented 3 weeks ago

Okay, I'll try running with https://github.com/haskell-effectful/effectful/pull/247 too.

michaelpj commented 3 weeks ago

New day, new stack trace!

undefined relinker
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:459:56 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  relinker, called at src/Effectful/Internal/Env.hs:179:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  cloneEnv, called at src/Effectful/Internal/Unlift.hs:148:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Unlift
  ephemeralConcUnlift, called at src/Effectful/Internal/Unlift.hs:122:3 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Unlift
  concUnlift, called at src/Effectful/Internal/Monad.hs:242:39 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  concUnliftIO, called at src/Effectful/Internal/Monad.hs:205:40 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  withEffToIO, called at <our code>
michaelpj commented 3 weeks ago

So we call an unlifting function, but when we copy our environment, some of it is missing?

michaelpj commented 3 weeks ago

I am also somewhat baffled to say that switching OnEmptyRollback to OnEmptyKeep seems to have made the problem go away. But it might well just be lurking, and we just don't happen to hit it.

arybczak commented 3 weeks ago

So we call an unlifting function, but when we copy our environment, some of it is missing?

Yeah, basically. cloneEnv goes through all effects in Storage and relinks them. Apparently in this case there is an effect there that was already deleted, but I don't understand at all how this could've happened since deleteEffect always deletes from the top and decreases the size of Storage.

It would be interesting if this indeed disappeared with switching to OnEmptyKeep, because it might be some weird interaction caused by restoreEnv (thatOnEmptyRollback uses) which overwrites the Storage, so in principle the size of the storage and its contents might diverge when this function is used, but of course it has a sanity check that relevant Storage sizes are the same -_- So I genuinely have no idea what's going on.

I fixed a bug in OnEmptyRollback that wouldn't roll back when the whole action failed in 1da88766ea52d28dbd711155639711b1c979c1ec, but that's unrelated.

I don't think I can do much more without a reproducer and no access to your code :(

Now that you have a location of the NonDet effect that went out of scope and the location of withEffToIO that attempts to call its relinker, how do these locations relate to each other? Perhaps with this information you could try to write a reproducer?

FWIW, I also pushed d97c9dbd519b818acb1887fa937cbec77cfe36bc so that when an undefined relinker is called, it will show both call stacks of its creation and its call site (since both are important), thought we already got them.

michaelpj commented 3 weeks ago

Hmm okay. I will see if I can make a reproducer.

Another question is: are there any functions exposed from effectful that don't do the right checks? We do have a bunch of effect definitions and it's possible that one of them does something unsafe.

arybczak commented 3 weeks ago

Apart from stuff in Effectful.Dispatch.Static.Unsafe, no.

Of course you can always just pass an Env to a different thread such that there is more than one thread operating on the same Env simultaneously, but then presumably you would've got errors from the thread-sanitizer branch.

Btw, did you stop getting this error with OnEmptyKeep?

michaelpj commented 3 weeks ago

Btw, did you stop getting this error with OnEmptyKeep?

Yes I did. Although to be honest, that may just mean that I've peturbed the situation enough that we happen not to trip over the problem, not that it's necessarily related. I'm planning to roll back that change occasionally to keep debugging :sweat_smile: