microsoft / vs-threading

The Microsoft.VisualStudio.Threading is a xplat library that provides many threading and synchronization primitives used in Visual Studio and other applications.
Other
992 stars 147 forks source link

Always throw from SwitchToMainThreadAsync if the token is canceled #434

Closed CyrusNajmabadi closed 4 years ago

CyrusNajmabadi commented 5 years ago

For more context, see thread here: https://github.com/dotnet/roslyn/pull/31787

Specifically starting here: https://github.com/dotnet/roslyn/pull/31787#issuecomment-447502124

TLDR: Roslyn switched some explicit usages of TPL+SyncContexts to use JTF for switching to the UI thread. This ended up causing crashes due to Roslyn depending on TPL behavior that JTF doesn't provide. Specifically, roslyn often 'chains' tasks along. For example, it might have the following sort of chain:

Task1(ui thread) -> ContinuesWith -> Task2 (expensive BG code) -> ContinuesWith -> Task3 (ui thread)

This is common for us so that we can update shared state and often let teh rest of VS know about something important in a STA manner.

While these tasks are running we are still processing inputs and, on the UI thread, we may decide to cancel this chain of work. For example, while Task2 is executing, we may end up cancelling on hte UI thread because of something the user did. With TPL, we had behavior whereby even if TPL decided Task3 was to run (because, say, Task2 had completed and things had been scheduled), once Task3 actually executed, it would always see the cancellation made on the UI thread. This happened automatically by TPL. Before starting to run Task3, it would do a final cancellation check and would then throw in this case. In essence, we depended on the TPL behavior that any changes made on one thread would be seen by later tasks that would run on that thread. Since we canceled on hte UI thread, it was certain that any other tasks intended ot run on the UI thread would always see that cancellation.

JTF does not have this behavior. Even if we cancel on the UI thread, JTF will allow a descendant task to both switch to the UI thread and continue running. This broke the expected behavior we god from TPL.

Fortunately, in the linked issue, this caused a crash, helping to track this down. But, far more often, when roslyn has encountered some sort of race like this, it has led to data corruption which often just leads to broken behavior for the user. Usually because data is updated that should not be updated.

--

The ask here is for JTF to have an option on these methods to behave like TPL. Specifically, one should be able to ask it to switch to the UI thread, and have it check+throw once it gets to the UI thread if previous code on teh UI thread canceled that exception.

While i would like this to be the default, it is recognized that this could be considered a fairly drastic change in behavior that would negatively affect others.

AArnott commented 5 years ago

Why would throwing from the main thread be a breaking change?

Because STMTA never throws OperationCanceledException on the main thread. If it could, it did its job and should report successful completion. So we're throwing in a case where we didn't before.

Isn't htat what most code that follows stmta will do when it itself uses that cancellation token?

Maybe, maybe not. That entirely depends on the STMTA caller.

Do we have an expectation that that would actually break people?

It's unlikely, IMO. But possible. Some code is adverse to running on the UI thread. But that's what we need to investigate before making such a change.

CyrusNajmabadi commented 5 years ago

If it could, it did its job and should report successful completion

I uess i view it exactly like '.ContinueWith'. .ContinueWith could be 'successfully' doing it's job by executing the continuation code on the requested thread if it had gotten to that point and was ready to go.

To me, i don't really see why STMTA is any different. Both are just requests to "get to this point, and then do followup work". I just see 'cancellation' as indicating "and don't bother doing this if things were canceled, because i indicated that i don't care."

weltkante commented 5 years ago

As a user of JTF outside VS I don't have a strong opinion about whether to allow for adjusting the behavior of STMTA, but introducing subtile semantic differences is something which should be weighted carefully against the utility in Roslyn. Maybe its better to have a dedicated (Roslyn specific) helper method for switching to the UI thread, which provides the desired semantics and use analyzers to control the TPL-to-JTF transition?

I guess i view it exactly like '.ContinueWith'

I don't think this mindset is scalable to async Tasks vs. TPL tasks, or are you assuming this just for STMTA?

For async Tasks in general I don't see any way this point of view can be correct, the Tasks generated by the compiler for async methods don't know anything about cancellation tokens you passed in, and nobody forces async method implementations to check their tokens on each returning branch, so cancellation is inherently lazy by design and you shouldn't assume anything of that kind. If you care about cancellation in async Tasks you need to check it yourself if you want to write robust code.

CyrusNajmabadi commented 5 years ago

or are you assuming this just for STMTA?

The question was about how STMTA should behve. I was saying that i view it similarly to .ContinueWith, and i would prefer cancellation behave as similarly as possible.

CyrusNajmabadi commented 5 years ago

so cancellation is inherently lazy by design and you shouldn't assume anything of that kind.

AFAIK, it's not an assumption. We discussed hte behavior of TPL in depth with the TPL authors as it was being created. We were heavy consumers and were communicating with them regularly about what semantics we could expect and what was considered undefined behavior. From all my recollections, this very much was by design.

If you care about cancellation in async Tasks you need to check it yourself if you want to write robust code.

Why? This is like saying that to be robust, i also should not assume that TPL sets up the right fences across tasks, and one should do that themselves to ensure that a descendent task sees the writes of an antecedent task.

In reality this is unnecessary, and we've at least had verbal confirmation that that is the case (though trying to find official documentation for this sort of thing is painful). So, should we then go update all our code accordingly because we don't trust these sorts of behaviors that have been baked into TPL for ages?

weltkante commented 5 years ago

You seem to have misread my comment. I was talking about cancellation of async Tasks (e.g. those generated by the compiler when writing async methods), not challenging the behavior of TPL. It's unfortunate that both use the same Task class.

Changing semantics for STMTA doesn't help reviewing any other async Task usage, cancellation is still lazy and the callers should check their tokens if early cancellation is important for them instead of lazy cancellation. As far as I'm concerned I'm reviewing STMTA cancellation like every other awaitable method. If you are interacting with async methods from external APIs you can't guarantee that they have checked their cancellation tokens on every returning branch, and even in Roslyn I assume going forward with the new completion APIs you have to consume async methods on interfaces implemented somewhere else in VS or even by VS extensions you can't control.

I understand that with TPL you didn't have to be explicit about cancellation checks, but if you transition away I'm arguing that you can't keep that property forever because it was dropped with the introduction of async Tasks alongside TPL tasks. So instead of special casing certain methods IMHO it scales better to just be explicit about cancellation everywhere and make the async/await assumption of lazy cancellation.

(And yes I acknowledge that for transitioning away from TPL you may want to bridge the semantic gap with things like STMTA but it may be a bit dangerous to trade one subtle semantic problem in Roslyn with adding a subtle semantic change for every consumer of the API. I don't know how likely it is to accidentally depend on not throwing when already being on the main thread, but it might be better to not require other consumers in VS to have to think about that edge case where they didn't have to before.)

Thats just my outside view though, as said I don't have a strong opinion and could deal with the change as far as our own codebase is concerned, just being a bit wary about the subtle semantic change against previous behavior.

CyrusNajmabadi commented 5 years ago

You seem to have misread my comment. I was talking about cancellation of async Tasks (e.g. those generated by the compiler when writing async methods),

I guess i'm not sure what you mean by that then. The compiler doesn't create any sort of special tasks. It just uses whatever Task type you return (normally System.Threading.Task or ValueTask). This is normally created with AsyncTaskMethodBuilder, with a bunch of callbacks indicating to which state things should transition as intermediary tasks complete. But the operation of those tasks still operates as per how the TPL operates...

Changing semantics for STMTA doesn't help reviewing any other async Task usage

The difference here, for me, is that the use of JTF here in roslyn was specifically is to supplant the usage of existing TPL mechanisms for moving to the UI threads. i.e. this isn't "any other async Task usage". This is specifically the "schedule the remainder of this work to happen over here".

I do acknowledge that that means this method is more specialized in my mind. And i do understand the argument that one can look at this and just think of this method as if it was just any random async-method. I do get that argument. I just feel that this is special enough to behave differently.

I don't know how likely it is to accidentally depend on not throwing when already being on the main thread

To me, this is the important question. I have a feeling that it would be pretty unlikely to hit anything problematic here. Anything using cancellation here for the switching seems like it will just continue using that cancellation after that point, which means cancellation coudl happen on that thread without any problem.

But i do also acknowledge this could be considered a change in behavior that might have negative results for some clients as well.

weltkante commented 5 years ago

Clarifying my point in case you care about what I meant, I agree with the rest of your response.

This is normally created with AsyncTaskMethodBuilder, with a bunch of callbacks indicating to which state things should transition as intermediary tasks complete. But the operation of those tasks still operates as per how the TPL operates

But that task created through AsyncTaskMethodBuilder (which I'm calling async Task) doesn't check any cancellation tokens, unless you do it explicitly in the method body. What you probably mean is that when you ContinueWith on an async Task that the returned task operates under TPL, but thats of course because ContinueWith produces a new Task according to TPL rules. You have to pass the cancellation tokens again to ContinueWith, TPL will not magically respect tokens passed to the original async Task, either.

So when you say you expected STMTA via await/OnCompleted to behave like ContinueWith I say thats a dangerous/unfounded assumption to take because any other async Task/Awaitable someone writes will by default not have that behavior either. (That doesn't argue against the rest of the discussion about making STMTA a special case where your assumption works.)

So if you have an async API consumed by Roslyn but implemented by VS or even an extension you can't just assume the Tasks returned by that API will respect your cancellation tokens the way ContinueWith does, you always have to check the cancellation tokens again (or forcefully switch to TPL by calling ContinueWith and pass the tokens again for implicit checking). Either way you need a second token check to have eager cancellation instead of lazy cancellation.

AArnott commented 5 years ago

Thanks, @weltkante. What I get from your argument is:

  1. In general, async methods that accept CancellationToken are generally free to honor or not honor that token. They certainly may not check them as the last thing they do before each of their exit paths. And therefore, it's not a good idea to rely on async methods to check the token on your behalf if you rely on the token being honored.
  2. It is a good idea to explicitly check the token yourself if you care about it.
  3. We should not take subtle behavior changes in a platform assembly lightly, because it's hard to anticipate its full impact on existing users and we don't want to cause app failures nor violate customer trust.

And from Cyrus's response:

  1. The STMTA method is not your typical async method, but because of its very special purpose and because it's one particular one that we can control. Therefore it is possible, justifiable, and convenient to ensure that it honors the token. This brings it inline with guarantees that TPL makes for scheduled tasks. This is useful for someone migrating from TPL scheduled tasks, and also probably everyone who calls STMTA. The risk of this subtle change is low because these callers are presumably prepared to deal with cancellation, even on the main thread.

I think all your asserted points are good ones.

FWIW, we haven't yet shipped a 16.0 versioned library, but we're about to. Making a behavioral change such as the one Cyrus is proposing would best be done at a major version increment so it aligns with customer expectations (per semver rules) that major version increments may include breaking changes. Our window for making changes to 16.0 is short.

At this point, I'm leaning toward the behavioral change, given the evidence @jasonmalinowski has found so far, and the point @CyrusNajmabadi has made that customers want to cancel if the token they passed in is canceled.

I'll discuss with the folks I'd have to convince to share the risk of regressions to see how they feel and will report back here. But further comments are of course welcome.

CyrusNajmabadi commented 5 years ago

What you probably mean is that when you ContinueWith on an async Task that the returned task operates under TPL, but thats of course because ContinueWith produces a new Task according to TPL rules

Yes. I thought that was clear from what i was saying, but if not i apologize. It's been a long thread, and there's been a lot stated. But i tried to make it clear that i felt STMTA was morally close enough to ContinueWith (since that is how you today run continuations on another scheduler) that it should behave the same way.

I def acknowledge that there's no statement anywhere that that needs to be so. And i'm def aware that changing to have that behavior is a subtle change and that i'm not in a position to make claims about.

It's just what i would have preferred :)

CyrusNajmabadi commented 5 years ago

At this point, I'm leaning toward the behavioral change, given the evidence @jasonmalinowski has found so far, and the point @CyrusNajmabadi has made that customers want to cancel if the token they passed in is canceled.

Another point of note: when we pointed out a subtle and potentially unintuitive part of TPL 5+ years ago, they attempted to go this path. They felt the change would be more intuitive as well. But, as it turned out, it led to breaks down the line. So the change was reverted and the new behavior was brought in through an option.

So i'm extremely sympathetic to the concern that this could break people. So much so that i have zero problem with that potential problem being a deal-breaker. So, please, only make the change if you truly think it is the right thing to do. If you think it isn't (possibly because it's too late now and we have to live with it), then i think it's 100% fine for Roslyn to be the one that has to change here. Not my personal pref. But this is not our library, and you're in the best position to make this call for your entire userbase.

Thanks!

CyrusNajmabadi commented 5 years ago

I'll discuss with the folks I'd have to convince to share the risk of regressions to see how they feel and will report back here. But further comments are of course welcome.

Maybe touch base with TPL and see if they ever had a similar sort of concern. i.e. did they ever decide to change around where an existing cancellation might get thrown, and did that ever cause problems. If they've done the same, and it wasn't an issue, that's good to know. And, if they did this and it was a problem, that would be pretty definitive as a major reason to not do this.

lifengl commented 5 years ago

I saw the proposal in the email thread. I don't think that I agree with the second half of the proposal. If the cancellationToken is triggered in the middle of the transition, and the continuation reaches the UI thread first, I don't think it should throw the CancellationException on the UI thread. Basically, it is always a race condition when the cancellationToken is triggered in the middle of a task, and there is no reason to ensure it will be honored. We can consider it is similar to the case that the task has already been completed before the CancellationToken is triggered, but the task continuation hasn't been executed. (Even after the change, the CancellationToken can always triggered right after we check it on the UI thread, and before it executes the continuation code.) It sounds strange, and logically incorrect to me that it will throw the Cancellation exception on the UI thread at that point. Just like Andrew pointed out, it would run the exception logic (and rest logic) in the UI thread (instead of the original context), which is unexpected.

CyrusNajmabadi commented 5 years ago

Basically, it is always a race condition when the cancellationToken is triggered in the middle of a task, and there is no reason to ensure it will be honored

There is no race condition. The cancelling has hapened on the thread that we are transitioning to. Since two tasks cannot execute on the same thread at the same time, then there is a strict ordering here of when things can happen. Either the cancel happened before, and thus we can read and safely cancel when we transition to that thread. Or, it happens afterwards... in which case, this is moot.

Just like Andrew pointed out, it would run the exception logic (and rest logic) in the UI thread (instead of the original context), which is unexpected.

Why is that unexpected? It's certainly how TPL works. Indeed, as mentioned, you had to always assume such a thing was possible since there is no guarantee of when a called async method might cancel. i.e. it might cancel as the very last thing it does. So, in that case, this would happen on that thread anyways.

lifengl commented 5 years ago

Yeah, there is no race condition, if the CancellationToken is triggered on the UI thread.

But on the other hand, in the normal await pattern, await MyAsyncMethod(), will always throw the exception in the original context. It never throws and runs the continuation in a random context.

CyrusNajmabadi commented 5 years ago

Yeah, there is no race condition, if the CancellationToken is triggered on the UI thread.

Right, and that's the particular case of interest, especially because it seems as if TPL made sure this works. So if we think of JTF as a suitable replacement to move from TPL-oriented patterns to, it would be nice for us (but certainly not necessary, or mandatory) if it followed these same sorts of intuitive behaviors. If i ask to run code on another thread, but i say i don't want to run if this token is canceled. And i then cancel that token on the very thread that that code would run on, it seems utterly sensible that that code not run. There is no race. There are clear semantics for what it means. And, if this was V1 of the library, i would have pushed for that had i even been aware of this difference.

The important question for me now is: what's the right path forward given that it already shipped. Based on the convos so far, and on how subtle concurrency/async can be, i'm leaning toward just doc'ing this behavior fully, having an analyzer, and saying that it's on the consumptive side to understand and deal with.

But on the other hand, in the normal await pattern, await MyAsyncMethod(), will always throw the exception in the original context. It never throws and runs the continuation in a random context.

It hardly seems random here :) As stated a few times (and well put by Andrew here https://github.com/Microsoft/vs-threading/issues/434#issuecomment-448647136), this is particularly about STMTA and the semantics specifically as a thread-swapping mechanism, much more akin to .ContinueWith(.. Main-Thread-Scheduler..).

I don't think I (or anyone else) is particularly asking for this to be any sort of defined behavior for all other Task-methods. Only ones that morally act in the stead of the pre-existing TaskScheduler oriented operations.

lifengl commented 5 years ago

It may or may not impact some of try/retry to switch to the UI thread logic. Some of our logic is done in the pattern like, SwitchToUIThread with a time-out (using a cancellationToken). If it works, we finish the work on the UI thread. If it doesn't get into UI thread on time, we run a different set of logic. That includes the logic to scan task dependencies to verify whether we are in a critical task chain (like blocking the UI), if that is true, we will try to switch without a time-out. If it is not true, we will delay processing it, and try it later. The reason we didn't do the dependencies check at the beginning is that it is an overhead, and might be slow. The change would mean that this logic might run in the UI thread, when the time-out happens to be triggered. Of course, this can easily be updated. But my point is that throwing on the UI thread may impact some existing logic.

CyrusNajmabadi commented 5 years ago

But my point is that throwing on the UI thread may impact some existing logic.

Understood. Thanks for the example.

AArnott commented 5 years ago

Given 16.0 is closing down and there wasn't enough appetite among other stake holders to take the risk of changing this behavior, I'm closing this as Won't Fix. This doesn't mean it wasn't a good idea and perhaps even the right choice, but it just means our window for making such a behaviorally significant change was a major version increment and as 16.0 is behind us now, that leaves #435 to mitigate this.

AArnott commented 4 years ago

More evidence from other VS code suggests that always throwing when the token is canceled may be the right policy. I'm preparing a change to do this.