microsoft / vs-threading

The Microsoft.VisualStudio.Threading is a xplat library that provides many threading and synchronization primitives used in Visual Studio and other applications.
Other
992 stars 147 forks source link

Always throw from SwitchToMainThreadAsync if the token is canceled #434

Closed CyrusNajmabadi closed 4 years ago

CyrusNajmabadi commented 5 years ago

For more context, see thread here: https://github.com/dotnet/roslyn/pull/31787

Specifically starting here: https://github.com/dotnet/roslyn/pull/31787#issuecomment-447502124

TLDR: Roslyn switched some explicit usages of TPL+SyncContexts to use JTF for switching to the UI thread. This ended up causing crashes due to Roslyn depending on TPL behavior that JTF doesn't provide. Specifically, roslyn often 'chains' tasks along. For example, it might have the following sort of chain:

Task1(ui thread) -> ContinuesWith -> Task2 (expensive BG code) -> ContinuesWith -> Task3 (ui thread)

This is common for us so that we can update shared state and often let teh rest of VS know about something important in a STA manner.

While these tasks are running we are still processing inputs and, on the UI thread, we may decide to cancel this chain of work. For example, while Task2 is executing, we may end up cancelling on hte UI thread because of something the user did. With TPL, we had behavior whereby even if TPL decided Task3 was to run (because, say, Task2 had completed and things had been scheduled), once Task3 actually executed, it would always see the cancellation made on the UI thread. This happened automatically by TPL. Before starting to run Task3, it would do a final cancellation check and would then throw in this case. In essence, we depended on the TPL behavior that any changes made on one thread would be seen by later tasks that would run on that thread. Since we canceled on hte UI thread, it was certain that any other tasks intended ot run on the UI thread would always see that cancellation.

JTF does not have this behavior. Even if we cancel on the UI thread, JTF will allow a descendant task to both switch to the UI thread and continue running. This broke the expected behavior we god from TPL.

Fortunately, in the linked issue, this caused a crash, helping to track this down. But, far more often, when roslyn has encountered some sort of race like this, it has led to data corruption which often just leads to broken behavior for the user. Usually because data is updated that should not be updated.

--

The ask here is for JTF to have an option on these methods to behave like TPL. Specifically, one should be able to ask it to switch to the UI thread, and have it check+throw once it gets to the UI thread if previous code on teh UI thread canceled that exception.

While i would like this to be the default, it is recognized that this could be considered a fairly drastic change in behavior that would negatively affect others.

AArnott commented 5 years ago

(I wrote this up for your original PR, but now that we have an issue dedicated to this discussion, I moved it here)

Good points, @CyrusNajmabadi.

Not to take away from anything you said, since from where you're coming from it all sounds perfectly reasonable, but I tend to go a different way.

TPL is largely dead. I say that because the general consensus is that folks should typically not use Task.Factory.StartNew or Task.ContinueWith any more, but rather use async/await patterns. There are a lot of reasons for this, and I'll assume you're familiar with them. As such, JTF was very specifically written to appeal to the async/await coding patterns rather than TPL. You'll notice we didn't even write a JTF-aware TaskScheduler so that folks who use TPL patterns could avoid deadlocks. That's partly because TPL TaskScheduler is so archaic that it doesn't actually support async tasks. I'm not knocking technology here, to be clear. It was written before async/await, and some of async/await is even built on TaskScheduler. But JTF SwitchToMainThreadAsync is a method to be called by async methods -- not by delegates queued with TPL. So they really are two very different worlds (async/await vs. TPL scheduled tasks). And the rules don't have to be the same. If we modeled JTF after 4.0-era TPL instead of after async/await, we'd be designing for the rocket scientist crowd (e.g. @CyrusNajmabadi) instead of for the average engineer, who has to work just to grasp how async/await work.

So if we are targeting folks who first learned async/await and never knew about .NET 4.0-era scheduled Tasks, then when TPL precedents and async/await precedents conflict, we should model STMTA the way other async methods tend to work to fit the expectations of the typical consumer. And in the async/await world, I don't think I've ever seen an async method that was wise to call with a CancellationToken and assume that the method's successful completion means the token wasn't canceled. So such a guarantee on JTF's STMTA didn't seem to have an interested audience (till now, at least).

I acknowledge that for the Roslyn engineering team that following TPL rules would make your migration easier given your background, but with respect, you're not the average customer. :wink:

sharwell commented 5 years ago

The ask here is for JTF to have an option on these methods to behave like TPL. Specifically, one should be able to ask it to switch to the UI thread, and have it check+throw once it gets to the UI thread if previous code on teh UI thread canceled that exception.

I do not believe this is a correct representation of the way the TPL behaves. More specifically, the TPL does not provide any equivalent to SwitchToMainThreadAsync, which means transitioning to JTF puts us in a place where the code is easier to read but there is no direct behavior mapping.

AArnott commented 5 years ago

So exploring your API request, @CyrusNajmabadi, there are two possibilities that occur to me (don't sweat the names I'm using... it's the shape I mean to discuss at this stage):

  1. Add a new method: JTF.SwitchToMainThreadThrowOnCancellationAsync(CancellationToken token)
  2. Add an overload to our existing one: JTF.SwitchToMainThreadAsync(token, alwaysThrowOnCancellation: true)

Either way, it requires some explicit opt-in. For the change in behavior to be clear, the naming has to be fairly verbose. And someone who has never seen it before has to study and think about it before realizing the nuanced impact it has. The "normal" behavior (given the original method/overload) still looks correct to the casual code reviewer, so where an opt-in to the new behavior is important, it could easily be missed.

Still a 3rd possibility is that you have a specifically configured JoinableTaskFactory instance. We already have a precedent for JoinableTaskFactory derived types that have specific priorities for scheduling work. We could allow a JTF instance to influence the behavior you're looking for. That would allow Roslyn to automatically pick up the desired behavior everywhere -- provided you're using the right JTF instance. So that's good in that it's easy to do the right thing (for your code base). But it may be harder to audit when you're not sure which JTF instance you're calling, or even aware that you're supposed to be calling a particular one.

I'm curious which one of these appeal to you folks.

I'm not a fan of the first two options especially, since today we already have a STMTA+ThrowOnCancellationRequested pattern that folks who have your requirement can already do. Adding another way, that is almost as tedious to code up (we're just translating a method call into a parameter or method name), wouldn't pass the .NET Framework API cost test, IMO.

sharwell commented 5 years ago

I'm less interested in the pattern we have now (referring to the switch+check pattern in dotnet/roslyn#31787), and more interested in the subset case where we avoid executing the continuation on the main thread if cancellation occurs on the main thread.

If cancellation occurs on a different thread, we have a race condition anyway and the cancellation check would only provide a false sense of security.

AArnott commented 5 years ago

more interested in the subset case where we avoid executing the continuation on the main thread if cancellation occurs on the main thread.

Can you explain this more? I'm not sure what you're arguing for. The subset you describe, from what I've heard on your PR and by email, is what Roslyn's case is in the first place.

If cancellation occurs on a different thread, we have a race condition anyway and the cancellation check would only provide a false sense of security.

See, I've never seen cancellation as a way to protect data. Even the naming around a cancellation token ("is cancellation requested") indicates to me that it's merely a request, and that it may or may not be fulfilled, sooner or later. IMO if I had data to protect, I would be very explicit about it, and consider anything else (like cancellation handling) to be an optimization on top of it.

CyrusNajmabadi commented 5 years ago

I do not believe this is a correct representation of the way the TPL behaves. More specifically, the TPL does not provide any equivalent to SwitchToMainThreadAsync

isn't that equivalent to scheduling a task on the appropriate sync context?

CyrusNajmabadi commented 5 years ago

See, I've never seen cancellation as a way to protect data. Even the naming around a cancellation token ("is cancellation requested") indicates to me that it's merely a request, and that it may or may not be fulfilled,

It may not be fulfilled in that you might write code to not respect it. However, we've written code to respect it. Note: none of this was strange when we did any of our discussions with the TPL people about this. Why would we need anything else when Cancellation already behaved this way?

i.e. we could have had another mechanism. one where we an antecedent task set some bit, and a later task checked that bit. But we'd also want that later task to throw... so that we wouldn't have to check at every layer. And we'd then have to catch and filter out that particular excpetion.

Except... now we've reinvented cancellation tokens :)

IMO if I had data to protect, I would be very explicit about it, and consider anything else (like cancellation handling) to be an optimization on top of it.

But, to me, we are being explicit. I am saying "all future work is canceled". We've designed our systems such that data mutation can only happen on tasks that will, by design, see that notification. So we've been explicit, and those descendent tasks must see that request because they are going to run on the same thread where we made that statement.

It's a 'request' in that there's no way to 'force' that thread (other than by aborting) to listen. But that would be the same with any other mechanism we made. If we set another boolean somewhere, that would still be a request, because we'd still depend on that other thread respecting the boolean.

It's all ultimately cooperative. Except that we leveraged the existing mechanism that was already well baked into TPL instead of going NIH and building our own :D

AArnott commented 5 years ago

isn't that equivalent to scheduling a task on the appropriate sync context?

Yes, roughly. But that's only equivalent at the highest level. Beneath that conceptual equivalence, almost all the mechanics of how it works underneath are different. And it looks completely different even at the surface when you consume the API. So aside from "they both get you to the UI thread", they're actually quite different.

CyrusNajmabadi commented 5 years ago

Either way, it requires some explicit opt-in. For the change in behavior to be clear, the naming has to be fairly verbose. And someone who has never seen it before has to study and think about it before realizing the nuanced impact it has. The "normal" behavior (given the original method/overload) still looks correct to the casual code reviewer, so where an opt-in to the new behavior is important, it could easily be missed.

Importantly, we can add a threading analyzer that says that we cannot call that other API as it doesn't match teh semantics we except.

CyrusNajmabadi commented 5 years ago

they're actually quite different.

This is not a compelling argument for me. If htey are 'completely different', then i would argue that Roslyn should not use JTF. We should not be mixing/matching totally different threading/concurrency systems. Because that's exactly how we end up in the state we got into, where we have broken behavior and not a single person (with probably 50+ years of combined experience around threading/TPL/JTF) had a clue that there was a problem.

You're telling me they're different. I get that. I'm saying: i don't want to mix/match such differing systems, because it just adds complexity in a system we have spent umpteen years trying to keep simple in order to prevent these problesm in the first place. So, given that, my end feelings are:

  1. Roslyn should not use TPL. (massively expensive, likely not going to happen any time soon).
  2. Roslyn should not use JTF. Unfortunate. JTF has nice benefits. I would like to use it.
  3. JTF makes concessions to behave more like TPL. My preference. Makes it feel like something we can more safely move to. Makes JTF and TPL work better together. Makes it easier to write analyzers that check for problems and which guides the team to use the right stuff. etc. etc.
AArnott commented 5 years ago

So we've been explicit, and those descendent tasks must see that request because they are going to run on the same thread where we made that statement.

That's fine. I'm not against you designing your system as you have. If you want to constrain yourself to one thread, and explicitly cancel and check tokens, that's great. Nothing wrong with reusing CancellationTokens for this -- don't reinvent them.

My argument is that just because TPL behaved a particular way, JTF doesn't necessarily have to conform to that precedent (that was never a precedent for JTF to begin with). And yet you can continue being explicit with cancellation -- but as you migrate from TPL to async/await, you need to change your assumptions about how cancellation tokens are honored from "TPL will handle it for us" to "we need to throw if it's canceled".

CyrusNajmabadi commented 5 years ago

Moved ths from the other thread:

TPL is largely dead.

This is the first i'm hearing anything of the sort... can you link to any sort of statements to that effect?

I say that because the general consensus is that folks should typically not use Task.Factory.StartNew or Task.ContinueWith any more, but rather use async/await patterns

Sure... but what i was talking about still applies with async/await. I was simply talking aobut .ContinueWith since htat's hte underlying mechanisms this boils down to. It doesn't actually depend on anything async/awaity.

i.e. we use async/await heavily in Roslyn. But we also expect that code that is scheduled back to the UI thread (i.e. because we might have used .ConfigureAwait(true) on code that started from the UI thread) will still see cancellations made on the UI thread.

And the rules don't have to be the same.

I do agree on that. But that means that from the Roslyn side of things, i'd prefer to not mix and match. Because, well, this is the very realistic outcome. Concurrency is already extremely hard. And one of the ways that Roslyn-IDE has managed that was to be very very very very very careful about things. We intentionally do not try to be fancy most of the time. We intentionally have tried to only use a small part of the surface area that TPL provides. We intentionally have opted for simplicity, consistency, and clear semantics above everything, so that we can deliver a correct system first, with everything else following.

Indeed, for Roslyn IDE, the core goals are:

Correctness first. It needs to do things properly, or else, there's no point. Non-blocking second. Given 'correctness', our primary use for TPL is simply to get things off the UI thread. Performance. We can benefit here by also being able to leverage threadpools. But that's really just an added benefit in some cases.

Many of us cut our teeth on the preceding systems that were overly complex and basically had race conditions galore. We intentionally moved far away from that for Roslyn, realizing that even having concurrency/asynchrony in the first place was already increasing complexity by orders of magnitude. Given that, we intentionally tried to keep things as simple as we possibly could. And that means carving out a narrow section of TPL we could completely understand (and teach to the team) and religiously following patterns and practices to make it possible to sanely develop and maintain this code without having race conditions every day of hte week. It's worth pointing out that some parts of Roslyn didn't go this route, and we're still paying the price with flakeyness there to this day.

but with respect, you're not the average customer.

We're not. But, tbh, i think we're an important customer in that we strive for simplicity, safety, consistency and clarity much of the time. IMO, a lot of what we've learned, and a lot of hte patterns and practices we've taken to allow for concurrency, while still be safe/correct, would be valuable for others to follow. These are tactics created to actually reign in the complexity of concurrency, and (as we can see here) when the system gets more complex and inconsistent, even incredibly capable and experienced teams can introduce very worrying bugs.

CyrusNajmabadi commented 5 years ago

My argument is that just because TPL behaved a particular way, JTF doesn't necessarily have to conform to that precedent (that was never a precedent for JTF to begin with). And yet you can continue being explicit with cancellation -- but as you migrate from TPL to async/await, you need to change your assumptions about how cancellation tokens are honored from "TPL will handle it for us" to "we need to throw if it's canceled".

That is certainly an option. But i personally don't think it's a good one. It violates a core engineering goal Roslyn had around concurrency which was to KISS.

Needing to understand these sort of subtle difference is the antithesis of KISS. Note: if JTF expects that it is used in isolation of TPL, then it's certainly the case that it is abiding by KISS itself. The problem then exists for the team that wants to have a reasonable migration path from TPL to KISS..

So, my question for you is: would you be amenable to making changes that go against JTF's internal simplicity, in order to help teams with the transition path from TPL to it? I personally think it would be a good idea. :)

sharwell commented 5 years ago

@CyrusNajmabadi It's not vs-threading that's violating KISS here. Roslyn using application of CancellationToken as a basis for algorithm correctness is already too complex; vs-threading operates under even simpler considerations. Thus KISS applied to vs-threading means this situation does not need special casing.

AArnott commented 5 years ago

TPL is largely dead.

This is the first i'm hearing anything of the sort... can you link to any sort of statements to that effect?

Maybe "dead" is too strong a word. But Stephen Toub I believe agrees, and other influential folks (including David Fowler) have repeatedly said that with async/await, the .NET 4.0 era TaskScheduler patterns are now in the domain of "rocket scientists". This is because it's tedious, hard to code right, and even harder to code more efficiently than async/await is. So much work went into the compilers to minimize allocations and overhead for async/await, that it's become quite difficult to write TPL code that is more efficient (GC pressure-wise, at least) than async/await.

CyrusNajmabadi commented 5 years ago

Maybe "dead" is too strong a word

image

:D

CyrusNajmabadi commented 5 years ago

Look. I'm not disagreeing with your guys' points. :)

I'm simply stating: this issue is a clear indication of something concerning. I'm coming from the direction of caring about the engineering here and the concerns about reliability and correctness. It's not an exaggeration to say that issues here are some of the worst Roslyn has ever had to deal with.

As such, i come from the perspective of wanting as safe and understandable a system as possible. And when 'the new api' behaves differently from 'the api that was the one to use up till now', and when we have the direct evidence of the sort of problem you run into becuase of that, then all the theoretical hopes and desires around simplicity and whatnot get thrown out the window for me :)

At this point, i can't really add more. The request has been made for a API that behaves more like TPL. If you want to add it, great. If you don't that's fine.

--

In the meantime, if there are other areas where you know that JTF deviates from TPL-like behaviors, can you let me know? I want to know if there are things i need to be carefully considering when i review code that uses JTF. Thanks! :)

AArnott commented 5 years ago

Regarding your enumeration of the options for mixing TPL with JTF, @CyrusNajmabadi, I sympathize. I almost never write TaskScheduler code any more, but I use JTF and Task.Run plenty.

TPL offers no way to avoid deadlocks with the main thread when synchronously blocking the main thread may be necessary (which of course, it is). So your option 2 (use TPL and not JTF) is simply not an option that I can imagine a working system for. Unless you can see something I can't in that area, that leaves mixing the two worlds, or abandoning TPL. Personally, I would go with "gradually migrate from TPL to to async/await/JTF". It's low cost, and not unnecessarily destabilizing.

CyrusNajmabadi commented 5 years ago

Thus KISS applied to vs-threading means this situation does not need special casing.

Note: simplicity can clearly mean different things to different people. Here's an example of something TPL doesn't that it didn't necessairly have to. Not doing it would have made TPL simpler, but would have made consuming TPL much harder:

Specifically, TPL has the bahvior that any .ContinueWith'd task will see the writes of its antecedent tasks. This is true regardless of waht threads either of those tasks run on.

Again, it wasn't necessary that TPL ensure that. It would have been simpler for it not to. It would have been a "special-case" it could have avoided. But it would have made using TPL so much harder as code would have to work much harder to ensure that writes by tasks were visible to descendent tasks.

So, when i say "KISS" i'm not saying from the library's perspective. I'm saying "from the consumers" perspective. It's much simpler to me that if i cancel a token on a particular thread that i can know that dependent tasks will certainly not execute on that same thread. This is consistent with the idea that "descendent tasks see my writes". It's strange for me to cancel and then have a descendent task not respect that given that it can only run after me (since it uses my thread) but seems to behave in a manner where even though it must run after me, and even though i know it sees my write, and even though i did cancel it, it doesn't notice.

I get that from an impl perspective it's 'simpler'. But from it's a consumption perspective it's more complex, especially as it deviates from other read/write guarantees.

sharwell commented 5 years ago

At this point I lean against API or behavior changes. The one motivating example presented so far is a case of desiring that implicit cancellation be respected at a location simply to avoid calling ThrowIfCancellationRequested. It's generally safe to omit the latter without impacting correctness, and in the remaining cases it seems desirable for the cancellation to be explicit.

I was interested in seeing if cancellation on the UI thread could force future continuations to use the thread pool, but internal telemetry indicates such a change would be prohibitively expensive in a subset of cases that happen to be closely monitored cases.

AArnott commented 5 years ago

Specifically, TPL has the bahvior that any .ContinueWith'd task will see the writes of its antecedent tasks. This is true regardless of waht threads either of those tasks run on. Again, it wasn't necessary that TPL ensure that. It would have been simpler for it not to.

I can't imagine a world where TPL didn't provide that guarantee. Task.ContinueWith by definition of the very method name suggests that the next task won't execute before the prior one completes. Why do you say TPL could have gotten away without that guarantee?

So, when i say "KISS" i'm not saying from the library's perspective. I'm saying "from the consumers" perspective.

To be clear, JTF is anything but simple internally. I am fully in favor of designing for simplicity in consumption of the library. But that's why I hesitate to add more API -- not because JTF becomes more complicated to implement but because it becomes more complicated to consume.

It's much simpler to me that if i cancel a token on a particular thread that i can know that dependent tasks will certainly not execute on that same thread.

Your scenario isn't a dependent task. In TPL there were clear task->continuewith dependencies where one task depended on another. But in async/await, there aren't explicit task dependencies. JTF had to invent that concept. In fact, in your case it's not even an implicit dependency, AFAIK. The use case Roslyn is looking at is rather a case of an unrelated async method requesting the main thread, after which someone else cancels the token. There's no dependency between two tasks there.

CyrusNajmabadi commented 5 years ago

So your option 2 (use TPL and not JTF) is simply not an option that I can imagine a working system for.

That's obviously untrue. Roslyn operated for like 10+ years with TPL and without JTF, while being able to use the UI thread. We designed for this and we very carefully and conscientiously implemented many components that worked fine in this regard.

And, IMO, this was a better state to be in. We never took it for granted that we could use the UI thread. Instead, we understood very well the risks involved and we designed and implemented our features very carefully to be explicit about how this works so it would be all safe. This ended up with very understandable systems IMO. Because we couldn't just easily do this sort of thing, we had to be very careful and clear about how we actually did it. And, in the end that was usually much better, because we weren't haphazardly doing things (like causing UI calls) at times when it was inappropriate.

So again, i reject the idea that you cannot have "a working system for" this case. We accomplished it quite well in a huge number of different domains. i.e. the number of different sorts of UI services that roslyn needs to interact with (while also wanting to be async+bg oriented) is huge. But we were able to conscientiously design ways to deal with this in all circumstances. And, as stated, i think forcing the separation was a good thing.

--

Also, since moving to JTF, i've had to block a bunch of PRs that have tried to now use JTF at inappropriate times. In the past, even trying to use the UI thread in the way the PRs were written simply wouldn't work. You would deadlock, and you'd immediately know this was a bad thing. With JTF, there is the sirens calls of "it's ok to make this call to the UI thread, and JTF will take care of it for you", when really it isn't. Yes, you won't deadlock. But you'll potentially massively degrade the experience of that feature.

Sometimes it's better to actually consider the UI thread something you should be extremely wary of using, and something that if you misuse causes major problems. It helps force you away from it, and it helps force you to design systems that are much more conscientious about how it is used. That's the approach Roslyn took for many years, and i think it very much was a good thing.

CyrusNajmabadi commented 5 years ago

I can't imagine a world where TPL didn't provide that guarantee. Task.ContinueWith by definition of the very method name suggests that the next task won't execute before the prior one completes. Why do you say TPL could have gotten away without that guarantee?

I can't imagine a world where JTF didn't provide the guarantee we're speaking about now. And yet here we are :)

Also, as an example, TPL got away with not having the guarantee that an descendant task wouldn't run prior to its antecedents completing. So, frankly, i could imagine them doing just about anything.

They could have easily said: if you want to see the written data, pass things along in your task-results to descendent tasks to read.

CyrusNajmabadi commented 5 years ago

But in async/await, there aren't explicit task dependencies.

Whether it is implicit or explicit doesn't make a difference to me. The same behavior should hold for implicit dependencies IMO. Having things be different just makes things complex on the consumption side.

AArnott commented 5 years ago

I can't imagine a world where JTF didn't provide the guarantee we're speaking about now. And yet here we are :)

I thought you were going to say that. But the difference is, ContinueWith screams "this happens later" within its own domain (and more broadly, I argue). But within the domain of async methods, a cancellation token has no guarantee whatsoever of being honored. That's a very strong precedent that JTF didn't set -- we just live with it.

CyrusNajmabadi commented 5 years ago

ContinueWith screams "this happens later" within its own domain (and more broadly, I argue).

Yes.... and when i say "ContinueWith" with this cancellation token, and it says "this will happen later", and i then cancel the token, i expect that the cancellation token would be respected if that later point is on my same thread... :)

AArnott commented 5 years ago

So your option 2 (use TPL and not JTF) is simply not an option that I can imagine a working system for.

That's obviously untrue. Roslyn operated for like 10+ years with TPL and without JTF, while being able to use the UI thread. We designed for this and we very carefully and conscientiously implemented many components that worked fine in this regard.

And yet, Roslyn eventually switched to JTF. Why is that? Did they need something that they couldn't get otherwise? As scenarios expand, and the need to interop better with other VS components grows, JTF becomes a more pressing requirement for anything that is async and may require the UI thread, such as Roslyn.

CyrusNajmabadi commented 5 years ago

And yet, Roslyn eventually switched to JTF. Why is that?

Don't know. I was very resistant to it. But i'm also off the team. I'm not apriori against it being used. But my preference is toward simplicity for precisely the reasons that started this entire conversation.

As scenarios expand, and the need to interop better with other VS components grows, JTF becomes a more pressing requirement for anything that is async and may require the UI thread, such as Roslyn.

Honestly, i don't see why that's the case at all. Roslyn interops with dozens (maybe hundreds) of VS components. And due to the intense UI-affinitized behavior of VS, haivng to be UI thread-aware was baked into Roslyn-IDE from day one.

In 100% of cases, we were able to properly do things in a conscientious and methodical fashion.

--

Note: again, i'm not against JTF. What i'm pushing for is a path to use JTF that has few "gotchas" as possible. I'm surprised that this would even be contentious. We're all engineers here. We know how problematic it can be when systems behave in subtly different manners. I'm just trying to find effective ways to be able to be an engineer that can use these libraries in a safe manner to deliver an altogether great product to customers.

Also, as an aside, i'm going to ignore me and just talk about Sam/Jason. They are, without question, two of the absolute best people i've know when it comes to understand the deep complexities around threading/async/await/tasks/VS/STA/ui-threads/component-designs/etc. Even with both of them carefully examining the changes to JTF (and even with me closely examining each line and giving lots of feedback on these PRs) this issue was missed. That's a big red flag that there's an issue here. I'd like Roslyn to be able to use JTF effectively. But part of that, to me, is having confidence that the rules and behaviors i've been able to trust from the TPL world are held in the JTF world.

So, my hope would be that thsi is the only place they deviate. But i have no confidence or reason to believe that's the case. That's why i asked if you could enumerate any other known differences. I don't want to continue learning about the differences because of ccrashes (or worse yet, subtle corruptions that may take months to discover). I'd like to know about them up front so i can be auditing code, and keeping that all in mind when i do PRs :)

AArnott commented 5 years ago

Sometimes it's better to actually consider the UI thread something you should be extremely wary of using, and something that if you misuse causes major problems. It helps force you away from it, and it helps force you to design systems that are much more conscientious about how it is used. That's the approach Roslyn took for many years, and i think it very much was a good thing.

:smile: It has come up a few times recently that JTF has made accessing the UI thread so easy that perhaps it's a disservice to itself. I always chuckle a bit when we discuss such a sentiment. I'm quite happy with the fact that we've now moved the problem space from "it's too hard to do it" to "it's so easy but we need to think about when to use it". As a product (or actually, a set of products since JTF is now used in several) we've been able to accomplish many things as far as solving UI delays and enabling greater concurrency and thus customer performance wins as a result of JTF making async/await compatible with UI thread requirements and synchronously blocking that UI thread.

We've heard two engineers' votes here from Roslyn: one for, one against. I'll keep the issue active for a while longer to see if anyone else has votes or arguments for the proposed change.

CyrusNajmabadi commented 5 years ago

We've heard two engineers' votes here from Roslyn: one for, one against. I'll keep the issue active for a while longer to see if anyone else has votes or arguments for the proposed change.

Sounds good :)

AArnott commented 5 years ago

Great. So not to continue the debate, but to answer a question you asked that I think leads to additional value we can glean from this discussion...

@CyrusNajmabadi said:

So, my hope would be that thsi is the only place they deviate. But i have no confidence or reason to believe that's the case. That's why i asked if you could enumerate any other known differences.

Enumerating differences is tough since the two systems are modeled so differently. But I can share common gotchas that I've seen from the many teams that have migrated so far. And we have analyzers that codify guards for those gotchas wherever we have the means (and time) to do so. I had written up a list of a few gotchas, but then realized that really, our Threading Cookbook and threading analyzers pretty much list all of them (including this one, now that my doc fix PR is in).

CyrusNajmabadi commented 5 years ago

This is a lot to digest. I'll try to budget a few hours to try to internalize it all. It def makes me more wary about hte usage of jtf here.

jasonmalinowski commented 5 years ago

@AArnott do we have concrete examples in VS today that rely on the existing behavior for correctness? Between seeing the places where Roslyn is being broken, and the places in the VS code that at least (to me) also look suspicious. I can understand if we want to add an option or analyzer to avoid a breaking change because yes, it's potentially breaking and I don't dispute that. But how much code is relying on this, intentionally? Counting up the uses I'm seeing a bunch that expect/probably want the behavior where cancellation is raised on the UI thread, and I'm not seeing the other; if that's just me looking in the wrong place I'd want to be corrected ASAP.

I do see the "things shouldn't cancel if they've done the work" argument, but I guess that doesn't seem particularly meaningful to me here. I'd consider that a weak guideline at best, and we've had good reasons to break that in the past. For example, Roslyn's implementation of cancellation-aware AsyncLazy will return a cancelled task for GetValueAsync(), even if we already had a cached result. We found it was better to give a cancelled task back when somebody wants a syntax tree versus let them start to analyze it, observe the cancellation later, and then give up anyways some time later. Put another way, if we didn't have that in AsyncLazy, we'd probably expect all consumers to do that to the point that we'd require an analyzer, at which point we didn't actually gain anything by following this "rule".

And yet, Roslyn eventually switched to JTF. Why is that?

I don't consider Roslyn as having moved to JTF, at least not in a meaningful way. Yes, it now consumes the NuGet packages, but the majority of our use is either SwitchToMainThreadAsync, or some limited cases in the VS codebase when we're interacting with legacy components. But we're not writing idiomatic JTF code by any stretch of the imagination, and doing so in many cases is still a API back-compat problem. To me the primary benefit up to this point has just been deleting our code that tried to give us an implementation "am I on the UI thread? I'd like to assert that" and "give me a TaskScheduler for running on the UI thread". If those helpers were available directly there's a lot of places we might have just used those, and I'd still use those. :smile:

It has come up a few times recently that JTF has made accessing the UI thread so easy that perhaps it's a disservice to itself. I always chuckle a bit when we discuss such a sentiment.

I'd say the don't chuckle; this to me is the truth and why I've loved that Roslyn didn't use JTF. It means when Roslyn is using the UI thread we are explicit about it and careful about it. When I did the Roslyn part of the project system/language service refactoring recently, there are totally places where I was on a background thread and a SwitchToMainThread() would have made my life easier. But the entire reason we found ourselves rewriting this code was to detach it from the UI thread; I'm cheering the fact that while I've been doing that the shell has been making more services available off the UI thread, which just negates the need for us to use JTF, and means the UI will be less blocked. That's the ultimate win in my book.

AArnott commented 5 years ago

I'd say the don't chuckle; ... When I did the Roslyn part of the project system/language service refactoring recently, there are totally places where I was on a background thread and a SwitchToMainThread() would have made my life easier. But the entire reason we found ourselves rewriting this code was to detach it from the UI thread;... I'm cheering the fact that while I've been doing that the shell has been making more services available off the UI thread, which just negates the need for us to use JTF, and means the UI will be less blocked.

Most components don't have the enviable opportunity to rewrite themselves like Roslyn did from the old C# project system and compiler. We have to make it easy to go from UI thread locked to async and enable switching to and from the UI thread easily or almost no one would have ever gone async in VS since they have to be allowed to do so progressively.

I've always maintained that JTF is a bridge between sync and async worlds. As such, if it serves its purpose well, one day we may not need JTF any more since VS will be 100% async.

CyrusNajmabadi commented 5 years ago

I'd say the don't chuckle; this to me is the truth and why I've loved that Roslyn didn't use JTF.

Agreed. it was very much a good thing (regardless of TPL/JTF/VS/whatever). It was very valuable to Roslyn to have to be extremely cautious and judicious about this stuff. We've all experienced the component that is technically async but still has a terrible UI experience due to misusing the UI thread. By both explicitly not having easy ways to get to hte UI thread, and also having it be that we would generally deadlock if someone even tried this, it forced Roslyn to design effective solutions in these scenarios. Contrast this with previous versions of C# which you might call "correct" but which would also be considered "extremely poor UI citizens" wrt how the UI thread was used. We didn't even allow ourselves to get to that point with Roslyn by basically making the floor lava here :)

To me, the "code just calls willy-nilly into the UI" is effectively as bad as the "STA code pumps on while taking locks" crazyness. At least it's always explicit when you're going to UI thread. But being simple and easy is not actually a virtue for me as i very much never want people to do it haphazardly in roslyn. Sometimes it's good that dangerous stuff is hard. It makes people really have to think very carefully and judiciously about things.

I've always maintained that JTF is a bridge between sync and async worlds

it's worth pointing out that i don't think that's what Roslyn has really used it for. It seems primarily to have been used to replace someone components we wrote that were TaskSheduler-based for working with hte UI thread. We had a system that did work (and did behave with TPL semantics we expected). But, i think we're always happy to remove code if it turns out to be unnecessary. there was just an mismatch in expectations. We though "SwitchToMainThreadAsync" was equivalent to "Schedule to run on UI thread, with the same cancellation token guarantees as what TPL offers". We've clearly learned the latter part isn't the case :)

Note: i'm not unhappy with us using it, and i think it works well for this purpose. It's just an important detail we have to add to our mental toolkit (and hopefully to our analyzer toolkit) for future safety.

CyrusNajmabadi commented 5 years ago

Roslyn's implementation of cancellation-aware AsyncLazy will return a cancelled task for GetValueAsync(), even if we already had a cached result. We found it was better to give a cancelled task back when somebody wants a syntax tree versus let them start to analyze it, observe the cancellation later, and then give up anyways some time later.

Agreed. I view cancellation in the inverted fashion. When somoene says "i care about cancellation on this token" they are make an affirmative statement that "i really don't need to run if this cancellation token triggers". So it is less efficient to ever run them since they'll now do work in an context which they already indicated they didn't want to do work for.

I imagine that's why TPL aggressively doesn't bother running those tasks in that case either. The task already told them not to bother, so at best it's just wasted cycles, at worse it's computation that does something it shouldn't.

In this regard, cancellation works in the same tpl assumptions as all the other 'Continue With Options'. I.e. you can say "only continue with this if it ran to completion" or "if it failed" or "if it canceeled" etc. etc. by passing cancellationTokens along you are basically saying "don't run this if canceled" without having to redundantly pass that continuation option along.

AArnott commented 5 years ago

Sometimes it's good that dangerous stuff is hard.

I have to disagree. Being dangerous is bad enough without it also being hard to do. Before JTF, there were a dozen ways to get to the UI thread, and none of them worked 100% of the time. And predicting which one would work for a given scenario was usually impossible to do, so we'd squint and give it our best shot, then wait for the deadlock bugs to come in. That's no way to develop a stable product.

With JTF, there's always just one way to do it. It's simple, it works, and if you consider a UI thread dependency to be evil, it's easy to notice that you have one and block its introduction in a PR review.

By both explicitly not having easy ways to get to hte UI thread, and also having it be that we would generally deadlock if someone even tried this, it forced Roslyn to design effective solutions in these scenarios.

If Roslyn really could go on perpetually without JTF, I guess it's only because you found ways to avoid the UI thread in any path that could become UI blocking and you managed to convince all your partners to never add thread affinity to their code that you called as well. If you can pull that off, more power to ya. You probably are helping VS be more responsive overall and I salute you. But that's a very hard thing to accomplish generally, if even possible -- and for many VS components, it simply isn't possible without serious backward compat breaking changes that would blocking shipping the product.

But anyway I think I'm digressing into debating the overall JTF rather than focusing on your specific request, which I think we can really distill down to the question that @jasonmalinowski is providing data for on another thread: "Is most code that calls STMTA with a cancellation token meant to proceed if the token is canceled, or are most authors assuming the token hasn't canceled after that point?" If we fix more subtle bugs than we introduce (by a healthy margin) by making a behavioral change, then I expect I'll come around to being willing to throw while still on the UI thread. I don't expect to come around to the idea of waiving an available UI thread to wait and throw later on a threadpool thread as that would create a "by design" perf delay that makes cancellation slower than successful completion. So far, the data Jason is collecting supports the behavioral change.

We had a system that did work (and did behave with TPL semantics we expected).

I'm really curious about this. Did you get it to work by eliminating all UI thread dependencies in code that might be synchronously blocked on by the UI thread? Or did you come up with your own way to allow UI thread-bound task continuations to execute even while you were blocking the UI thread? Did you rely on RPC anywhere?

I imagine that's why TPL aggressively doesn't bother running those tasks in that case either. The task already told them not to bother, so at best it's just wasted cycles, at worse it's computation that does something it shouldn't.

That's a very good point. And aligns with what Jason's code searches are turning up too, I think.

My own argument that STMTA should behave like other async methods is somewhat weakened when we consider that it is not a typical async method. It literally returns a custom awaiter that does nothing but reschedule work (very analogous to a TPL continuation), so perhaps in this async method, following the TPL behavior makes sense. Not because it's TPL and we're modeling after it, but because the original reason TPL took that path also applies to STMTA scenarios.

If we do change the behavior, we need to consider this, and I'd like your input on it. What should happen if you're already on the main thread but your token is canceled when you call STMTA, assuming alwaysYield: false?

CyrusNajmabadi commented 5 years ago

Or did you come up with your own way to allow UI thread-bound task continuations to execute even while you were blocking the UI thread? Did you rely on RPC anywhere?

We just designed things to never do this. If you were on the UI thread, it was only acceptable to ever block a completely computation-bound bg task that itself never had any sort of UI need. And, normally, we've tried to make it so that the UI thread just doesn't try to even block those guys. We've had to in a small number of cases because there is no other way. But, if we have requirements on other systems we consider that to be a responsibility of the UI thread, and not of the BG work we've kicked off. It's kept us sane, and has worked well for pretty much everything that's been thrown at us so far :)

CyrusNajmabadi commented 5 years ago

If we do change the behavior, we need to consider this, and I'd like your input on it. What should happen if you're already on the main thread but your token is canceled when you call STMTA, assuming alwaysYield: false?

Is there something wrong with cancellation at that point? I do assume that any time i await a call that is passed a real cancellation token that it can, well, cancel :) This could be an 'await' to just any method-that-takes-cancellation, and any method-that-takes-cancellation may throw on the very first line, and that would be fine. So i'm not seeing how things change to STMTA.

Is there a reason to expect that it would behave differently, even if you were on the UI thread and called into it? Thanks!

CyrusNajmabadi commented 5 years ago

With JTF, there's always just one way to do it. It's simple, it works, and if you consider a UI thread dependency to be evil, it's easy to notice that you have one and block its introduction in a PR review.

Good point. I can very much respect the value and sanity that brings to many teams :)

CyrusNajmabadi commented 5 years ago

and you managed to convince all your partners to never add thread affinity to their code that you called as well

Basically, as long as VS has been around, you could never change thread affinity (i mean, i suppose you could make a STA component MTA... not sure if anyone ever did though). So, we tried to get free-threaded components from lots of teams. And, if we coudn't we just accepted the UI thread as a necessarily evil and tried to design things so that regardless of that evil, we weren't making things worse.

sharwell commented 5 years ago

I do assume that any time i await a call that is passed a real cancellation token that it, well, cancel

I never think if it this way. I expect cancellation to trigger the fastest path to the end of the method, which would be successful completion if the remaining work is cheaper than throwing an exception.

CyrusNajmabadi commented 5 years ago

I wasn't saying that it will cancel. Merely that it can cancel. i.e. if given:

Foo(cancellationToken);

If i pass that with a cancelled token, then my expectation is absolutely that this can cancel (at any point or any time), including (but obviously not limited to) immediately upon entering the method. Ergo, the same holds true for STMTA. if i call it with a canceled token, then having it throw seems utterly reasonable and in line with what anyone might think was going to happen for any cancellation aware method.

I expect cancellation to trigger the fastest path to the end of the method

Looking at roslyn, it looks like that's what... one in a 100 occurrences**? I can't even see anything in the framework that would generally make me think this is reasonably common.

--

** Seriously, in roslyn fast-past checks happen 8 times. Cancellation throwing happens 800. I would say that's practically noise for the former case.

AArnott commented 5 years ago

I expect cancellation to trigger the fastest path to the end of the method

Looking at roslyn, it looks like that's what... one in a 100 occurrences**? I can't even see anything in the framework that would generally make me think this is reasonably common.

Um, what? Are you saying, @CyrusNajmabadi that you wouldn't be surprised if canceling an async method would make the method take longer to complete than if you hadn't canceled it?

jasonmalinowski commented 5 years ago

I expect cancellation to trigger the fastest path to the end of the method, which would be successful completion if the remaining work is cheaper than throwing an exception.

How expensive is throwing an exception? How many equivalent IL instructions is it? "Do the cheapest" is definitely a good argument, do we have any numbers to know what that actually is?

jasonmalinowski commented 5 years ago

If we fix more subtle bugs than we introduce (by a healthy margin) by making a behavioral change, then I expect I'll come around to being willing to throw while still on the UI thread. I don't expect to come around to the idea of waiving an available UI thread to wait and throw later on a threadpool thread as that would create a "by design" perf delay that makes cancellation slower than successful completion. So far, the data Jason is collecting supports the behavioral change.

Fully agreed that jumping back to the thread pool to raise the cancellation is silly, so let's not do that. And I'm also 100% for just an analyzer if we determine it is too much of a risk to change the behavior. More than anything else, I'd love it if owners of the code using SwitchToMainThreadAsync that did cancellation could look at their code and see if there are in fact issues there or not. It looks fishy, but all it takes is one case where it's clearly a regression and well, so much for a behavior change.

My own argument that STMTA should behave like other async methods is somewhat weakened when we consider that it is not a typical async method.

Agreed fully: this is as magic of a method as you get; if there was a method where usual rules wouldn't apply, it's here!

CyrusNajmabadi commented 5 years ago

Um, what? Are you saying, @CyrusNajmabadi that you wouldn't be surprised if canceling an async method would make the method take longer to complete than if you hadn't canceled it?

I'm saying: i find it to be completely acceptable and normal for a canceled method to throw a cancellation exception.

It's literally the norm. It's how we ourselves write 99%+ of all our cancellation code. We've got a decade of experience telling us that this is fine. So, as i mentioned on your original question: I am 100% ok if STMTA throws if it gets a canceled cancellation token.

CyrusNajmabadi commented 5 years ago

How expensive is throwing an exception? How many equivalent IL instructions is it? "Do the cheapest" is definitely a good argument, do we have any numbers to know what that actually is?

Given that this is how roslyn basically behaves all the time, and we're massively concurrent+cancelling, i'd say: this is not a concern that we care about.

The only time i've ever had an issue here was at one point when the debugger had a massive perf hit when an exception was thrown. In Roslyn this was then painful as you could get hundreds of cancels a second as things like typing kicked off work only to have them cancel a short while later.

However, that issue was at least 5+ years ago, and there hasn't been any problems around cancellation and perf. Indeed, as Jason mentioned, cancelling early is great for perf because we then don't run any computation that is now unnecessary since we already said "we don't want to run anything if cancellation happens".

AArnott commented 5 years ago

For the record, I'm not concerned about the perf cost of throwing an exception. I'm concerned about the perf cost of waiving access to our current UI thread and waiting for an arbitrarily long period of time for a threadpool thread to come along so we can throw the exception on the threadpool instead of the main thread. Throwing from a threadpool thread wouldn't be a breaking change. Throwing from the main thread would be. But throwing from the main thread makes more sense from a perf perspective.

CyrusNajmabadi commented 5 years ago

Throwing from a threadpool thread wouldn't be a breaking change. Throwing from the main thread would be. But throwing from the main thread makes more sense from a perf perspective.

Why would throwing from the main thread be a breaking change? Isn't htat what most code that follows stmta will do when it itself uses that cancellation token? isn't that what roslyn is doing here?

Do we have an expectation that that would actually break people?