`resume`/`callback` in `cont`, `async`, etc. should return a `Boolean`

armanbilge commented 1 year ago

This Boolean should be similar to the Boolean returned when completing a Deferred i.e. it should indicate that this invocation of the resume/callback is the one and only one that actually completed the callback. Everyone else should get false.

This information about whether the callback succeeded would help avoid leaks in various concurrent state machines. Here are two that I encountered recently:

durban commented 1 year ago

I think this is a good idea. I suspect it would('ve) make fixing #3603 almost trivial. (I assume the semantics would be that if a cancel won, the callback would return false.)

Something worth thinking about is how would this affect Deferred#complete: it already returns a bool, but that means a different thing.

durban commented 11 months ago

To answer my own question: I don't think this should affect Deferred#complete. That is a similar, but different thing. With a Deferred, there is 0 or more subscribers, and it is "normal" not to wake up some of them. With async/cont/etc., there is one "subscriber": the fiber itself; and it is important whether we woke it up.

durban commented 11 months ago

I've started working on this (#3917), and the desired semantics are not entirely clear.

What should resume return (true or false), if resume wins the race? get did not run (probably yet), and resume sets the result. This seems like it should be true, but I think nothing stops the fiber from being cancelled right after this. So it's not really true or false, but "we don't know (yet)". Waiting in resume is obviously not an option. (Also note, that get may not even be sequenced, so it may never run.)

Any ideas?

armanbilge commented 11 months ago

but I think nothing stops the fiber from being cancelled right after this

Really? I would say that's a bug. If resume completes with a value (and returns true), then get must resume with that value. If you can cancel between these steps, then you can lose values.

But it depends what you mean by "right after this". If you mean steps after the get, that is fine.

durban commented 11 months ago

Well, how about something like this:

      new Cont[IO, Int, String] {
        def apply[F[_]](implicit F: Cancelable[F]) = { (resume, get, lift) =>
          lift(IO(resume(Right(42)))) >> F.canceled >> get.map(_.toString)
        }
      }

With the current implementation this self-cancels (I think this is expected). But I think it could be cancelled at any of the >>s also. Which also seems "fine" (if a bit strange).

I think the point is, that currently when resume executes, it may have no way of knowing what will happen with the value it writes. E.g., because get did not run yet, or in fact it may never will run.

armanbilge commented 11 months ago

Ok, thanks, I see your point now.

I think the point is, that currently when resume executes, it may have no way of knowing what will happen with the value it writes.

I think semantically the true means that if get runs, then this will be the value it returns.

durban commented 11 months ago

I'm not sure that helps with the "lost value" problem. For example, let's see asyncCheckAttempt (async is of course similar):

      def apply[G[_]](implicit G: MonadCancel[G, Throwable]) = { (resume, get, lift) =>
        G.uncancelable { poll =>
          lift(k(resume)) flatMap {
            case Right(a) => G.pure(a)
            case Left(Some(fin)) => G.onCancel(poll(/*HERE*/get), lift(fin))
            case Left(None) => get
          }
        }
      }

Let's say, that resume runs first, writes the result, gets a true (if get runs, it will return that result). But then the fiber gets cancelled right before get is sequenced (marked with HERE in the code above)... and the result is lost. (But whoever called resume thinks it isn't lost.)

Arguably, this is a "bug" in asyncCheckAttempt. (Well, not in the current version, but in the version where cb returns a Boolean.) I'm still thinking how to fix this bug, but I'm also not sure the bug is really in asyncCheckAttempt...

armanbilge commented 11 months ago

Hmm, thanks for writing that out. Actually if I understand correctly, this seems to be a very general problem (not specific to asyncCheckAttempt).

to avoid leaking values, we must be able to guarantee that get is sequenced
to support cancelation, get must be cancelable

Currently there is no way to have a cancelable get that is guaranteed to be sequenced 🤔 I'm not even sure if such a thing is possible, without a tryGet and a combinator like onCancelMasked from https://github.com/typelevel/cats-effect/issues/3474#issuecomment-1509872654 to "defer" a cancelation signal to avoid leaking values.

armanbilge commented 11 months ago

dumb-idea-of-the-day

poll(/*HERE*/get)

what if we just disallowed cancelation there (i.e. change the implementation of uncancelable/poll)?

durban commented 11 months ago

this seems to be a very general problem (not specific to asyncCheckAttempt).

Yeah. Although, I suspect that if we can solve it for asyncCheckAttempt, we can probably solve it for others too. The typical usages of cont seem to be quite similar, and asyncCheckAttempt seems to be the most general one.

Also worth noting, that this whole problem only occurs, if resume wins the race. If get runs first and suspends the fiber, that's fine. If the fiber is cancelled before resume runs, I think that's also fine (there is a cancellation check in resume).

what if we just disallowed cancelation there (i.e. change the implementation of uncancelable/poll)?

Yeah, I thought about that, but that has... consequences. If we look at this:

poll(/*1*/ x /*2*/)

The point 2 is already not a cancellation point. If we also make 1 not a cancellation point, that makes poll a no-op in certain cases (e.g., if x is pure, or a single delay, or something like that). This might not be a deal breaker, but I'm not sure.

armanbilge commented 11 months ago

that makes poll a no-op in certain cases (e.g., if x is pure, or a single delay, or something like that). This might not be a deal breaker, but I'm not sure.

Yeah, honestly that seems fine to me 🤷 but I won't claim to foresee all the consequences :)

djspiewak commented 11 months ago

Let's say, that resume runs first, writes the result, gets a true (if get runs, it will return that result). But then the fiber gets cancelled right before get is sequenced (marked with HERE in the code above)... and the result is lost. (But whoever called resume thinks it isn't lost.)

Bolded the relevant bit here.

So in this scenario, whoever called resume definitely knows the value is lost, because their finalizer is getting called. Consider this from the user side. I'll use async instead since it's a simpler signature:

IO.async(cb => IO(/* things that call cb */ Some(IO(/* finalizer */))))

Okay so if the code at /* finalizer */ runs in the above, we know definitively that any value (or error) which had been passed to cb is lost. At this point, we're responsible for sorting that out, either by recovering it somehow or getting an upstream to retry, etc. Whatever it takes really.

In some cases, this isn't reliably possible without sacrificing other system properties, so instead what often happens is people treat this as a nondeterminism (even though it isn't one) and dodge the recovery by instantiating async to Unit and solely using it as a latching mechanism (see: the new Queue implementations, for example). This avoids having to create weird recovery protocols.

But overall, this is something that users can cleanly respond to, unlike the cancelable gap at the end of poll regions, to which they have no response.

durban commented 10 months ago

whoever called resume definitely knows the value is lost, because their finalizer is getting called

That is a very good point. (It would be "nicer", if the return value could give this information directly, but I'm beginning to think that might not be possible.)

Now the question is: could we build something (useful) with both the Boolean and the finalizer, which we couldn't already build with just the finalizer? (Because if not, then this issue is not really needed.)

durban commented 9 months ago

So, about never cancelling (observing cancellation?) right inside poll, i.e.,

poll(/* we never cancel HERE */ x)

It is possible. In my draft PR (https://github.com/typelevel/cats-effect/pull/3917), I did it (not in a very elegant way, as it's just an experiment for now). It seems to work. A few things needed fixing (notably in Resource), but nothing too terrible.

Edit: of course, this is yet another violation of the functor laws I think. The existing one is for poll(x /* HERE */), so it's not surprising.

typelevel / cats-effect

`resume`/`callback` in `cont`, `async`, etc. should return a `Boolean` #3553