Nested Future blocks do not run in parallel in Scala 2.13.x

lihaoyi commented 4 years ago

reproduction steps

In the code below, we can see that in Scala 2.13.2, the three nested Future blocks run sequentially, whereas in Scala 2.12.12 they run in parallel. I would expect them to run in parallel in both cases, as I am spawning a small number of Futures and have 16 cores on this machine that should be able to run them in parallel.

$ sbt '++2.13.2!; console'
Welcome to Scala 2.13.2 (OpenJDK 64-Bit Server VM, Java 1.8.0_252).
Type in expressions for evaluation. Or try :help.

scala> import scala.concurrent._, ExecutionContext.Implicits._, duration.Duration.Inf

scala> def slow(key: String) = Future{ println(s"$key start"); Thread.sleep(1000); println(s"$key end"); key }

scala> def runAsyncSerial(): Future[Seq[String]] = slow("A").flatMap { a => Future.sequence(Seq(slow("B"), slow("C"), slow("D"))) }

scala> Await.result(runAsyncSerial(), Inf)
A start
A end
D start
D end
C start
C end
B start
B end
val res0: Seq[String] = List(B, C, D)

$ sbt '++2.12.12!; console'
Welcome to Scala 2.12.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252).

scala> import scala.concurrent._, ExecutionContext.Implicits._, duration.Duration.Inf

scala> def slow(key: String) = Future{ println(s"$key start"); Thread.sleep(1000); println(s"$key end"); key }

scala> def runAsyncSerial(): Future[Seq[String]] = slow("A").flatMap { a => Future.sequence(Seq(slow("B"), slow("C"), slow("D"))) }

scala> Await.result(runAsyncSerial(), Inf)
A start
A end
B start
C start
D start
B end
C end
D end
res0: Seq[String] = List(B, C, D)

AFAICT this is a minimal repro; either (A) wrapping the Thread.sleep in blocking() or (B) using an explicit Threadpool execution context or (C) removing the flatMap wrapper all seem to make it run correctly parallel in 2.13.2. This applies to all versions 2.13.x

OTOH -Ydelambdafy:inline or replacing the flatMap with map + flatten seems to make no difference. Neither does changing the JVM version from 8 to 11, or swapping between sbt console and Ammonite, or moving it from the REPL into a .sc script or full .scala file in a Mill project.

In the above example I'm spawning a small number of Futures, but in the case where I hit this in the wild I was spawning 100s of Futures and having them all run serially instead of in parallel, resulting in a 16x slowdown over what I was expecting

Notably, this slowdown applies regardless of how long the operations are: an operation that takes 1000ms to run 16x parallel now would take 16000ms, but an operation that takes 1ms to run 16x parallel would now take 16ms. Both could be equally bad depending on where it happens (e.g. in a lot of backend services, extra 15ms on every operation would violate SLAs)

Furthermore, the Futures documentation precisely and accurately describes the behavior of the ExecutionContext.global pre-2.13, as @jimm-porch has pointed out (https://docs.scala-lang.org/overviews/core/futures.html):

The number of concurrently blocking computations can exceed the parallelism level only if each blocking call is wrapped inside a blocking call (more on that below).

Last but not least, you must remember that the ForkJoinPool is not designed for long lasting blocking operations.

This description is no longer accurate as of 2.13. From what I have gathered in the discussion on this thread, the 2.13 behavior is better described as:

The number of any concurrent computations can exceed one only if each call is wrapped in a blocking call

Last but not least, you must remember that the ForkJoinPool is not designed for CPU-bound operations.

SethTisue commented 4 years ago

my understanding is that this is working as designed, as per https://github.com/scala/scala/pull/7470, and that you're expected to deal with it with:

wrapping the Thread.sleep in blocking()

cc @viktorklang

SethTisue commented 4 years ago

(Perhaps https://github.com/scala/scala/releases/tag/v2.13.0 should have some wording about this.)

lihaoyi commented 4 years ago

A silent change in time taken to run the simplest of behaviors from O(1) to O(n) is certainly not what I would expect when I read “faster and more robust” 😛

The linked PR talks about "combinators". I'm not using any combinators here, except from the wrapping flatMap which is really irrelevant. The only thing I'm doing is spawning Future{...} blocks

som-snytt commented 4 years ago

On the gitter, I said, "it would be nice if the docs explained the idiom and also why it isn't self-explanatory."

The side-effecting onComplete and foreach are get-out-jail-free, perhaps because of the guidance that you don't care when they run or in what order. Maybe also because it represents extra or other work.

Someone also quoted the code comment for batching that is not in the docs. (I just remembered that I've had to quote before, too. Apparently Scaladoc lets you include --access again for non-public members.) The quote breaks off before the caveat:

However,
 * if tasks passed to the Executor are blocking
 * or expensive, this optimization can prevent work-stealing
 * and make performance worse.
 * A batching executor can create deadlocks if code does
 * not use `scala.concurrent.blocking` when it should,
 * because tasks created within other tasks will block
 * on the outer task completing.
 * This executor may run tasks in any order, including LIFO order.

As shown above, the comment should read, "especially LIFO order."

viktorklang commented 4 years ago

A silent change in time taken to run the simplest of behaviors from O(1) to O(n) is certainly not what I would expect when I read “faster and more robust” 😛

If you didn't do the Thread.sleep() it would most likely be faster. Also, not surrounding it in blocking{} is inadvisable.

The linked PR talks about "combinators". I'm not using any combinators here, except from the wrapping flatMap which is really irrelevant. The only thing I'm doing is spawning Future{...} blocks

From the Future.apply scaladoc:

The following expressions are equivalent:

val f1 = Future(expr) val f2 = Future.unit.map( => expr) val f3 = Future.unit.transform( => Success(expr))

If you want to run a side-effect you should consider Future.unit.foreach(_ => effect) or Future.unit.onComplete(_ => effect)

lihaoyi commented 4 years ago

I don't want to run a side effect, I just want to be able to parallelize these computations using Futures, the same way I have been doing since Scala 2 introduced Futures and Promises in SIP14. What puzzles me most is that surely I can't be the first person bumping into these issues? I assume that other people also rely on Futures to parallelize things? The above code snippet isn't exactly an esoteric edge case.

The time taken to run each individual task seems irrelevant here: we have code that used to be parallel, but is now not, and the slowdown can be made arbitrarily large depending on how many Futures I kick off. I hit this kicking off 100ish Futures on a 16 core machine and having the all complete sequentially, a 16x slowdown. If I was running on a beefy EC2 instance like a m5.24xlarge with 96 cores it would have been a 96x slowdown. It would take quite a lot of reduction in fixed overhead to overcome a 96x slowdown due to loss of parallelization.

Do Futures created in other Futures never run in parallel any more when using the default ExecutionContext? If not, then when do they still in parallel? This isn't just a theoretical concern: in one part of our system it was causing 16x slowdowns, in another it was causing mysterious 5-10 second delays in our asynchronous push-update code paths. Both of these are real user-facing consequences that I as the product owner can't brush under the rug, so I'd like to be able to understand the mechanism that's causing this to behave as it does

viktorklang commented 4 years ago

@lihaoyi Whether or not something will run in parallel is up to the ExecutionContext. With the default EC it will parallelize "top-most" Futures (assuming that it has more than one thread) and break off eligible "callbacks" once a blocking{} is encountered. Assuming you have written your code to take an implicit ExecutionContext you can pass in one which gives you the desired behavior—there is always a tradeoff between performance and coordination overhead.

What could be considered is a system property to turn batching off, if one wants to maximize parallelism.

lihaoyi commented 4 years ago

With the default EC it will parallelize "top-most" Futures (assuming that it has more than one thread) and break off eligible "callbacks" once a blocking{} is encountered

Thanks Viktor! This is exactly what I wanted to know. I'm aware of the tradeoff between parallelism and coordination overhead. I must say that I have never seen the blocking{} semantics explained before in such a clear and concise fashion

While I don't really agree with changing the choice of tradeoff so drastically for the global execution context in Scala 2.13, it is what it is.

@SethTisue is there any way we can advertise this large change in behavior more broadly? I do think of myself as someone who knows my way around Scala and Futures, but this change in Scala 2.13 completely blindsided me. I spent months wondering why my Futures were sometimes not scheduling promptly (but never in a minimal dev environment, only in production under load!) and a full day trying to minimize the test case presented above from a program behaving 16x slower than it should. I imagine the vast majority of Scala developers would have even more difficulty than I would if they bump into this behavior change in the wild.

lihaoyi commented 4 years ago

Also, if this is expected behavior should we close this issue?

som-snytt commented 4 years ago

I was going to suggest keeping this issue open for improving the docs for clarification. After re-reading the Scaladocs, Javadocs, and the overview section, I understand that task scheduling is up to the execution context, that fork-join tasks should be of an appropriate size, that I don't know what API incurs a submit or a fork in the global context, and that I'm not convinced blocking is a great way to say "this bit of code doesn't block but is an appropriate unit of work for work-stealing." (The Javadocs say a unit of work should be yea-big so scheduling overhead doesn't swamp useful work. What if future used a macro that examined byte code to estimate benefitsFromBatching?)

The overview has a stub section for "extensions", so maybe a point of adumbration would show how to write some code to improve fork-join ergonomics.

Alternatively, a blog post about the Klang levels, K0 thru K3, to correlate expertise with what usages one should attempt. Maybe the API should all take an implicit K-level, import concurrent.K1 would enable certain combinators but no blocking.

I see I was due for a semi-annual review of how do futures work again? because my last was for https://github.com/scala/bug/issues/11827 which was also unexpectedly expected behavior. There must be an Onion headline along the lines of, "Coder most surprised to learn they violated the principle of least surprise."

viktorklang commented 4 years ago

@som-snytt

I'm not convinced blocking is a great way to say "this bit of code doesn't block but is an appropriate unit of work for work-stealing."

That's not really what happens though—if there are enqueued operations locally (on the current thread) then those will be submitted to be executed to the pool prior to running the blocking chunk of code, in order to preserve liveness under blocking. This paired with ManagedBlockers in FJP works rather nicely together by trying to ensure that there are threads available to process non-blocking operations in the face of blocking operations.

nadavwr commented 4 years ago

Prior to 2.13 I remember first seeing blocking used to convey non-blocking fork in the context of Akka's default dispatcher, which at the time was already batching.

Now that ExecutionContext is batching in Scala 2.13, perhaps adding fork as an alias to blocking would be advisable? This would better convey user intent—the user intends the nested Future not to be linked to the same fiber.

mushtaq commented 4 years ago

@viktorklang a clarifying question:

If the slow method above does not use Thread.sleep, instead it represents a non-blocking web-service call (say using Akka Http client), then the nested invocations of slow within runAsyncSerial will run in parallel, right?

viktorklang commented 4 years ago

@mushtaq It depends on how/where the async web-service calls are processed—they would be initiated serially by the Global EC, but their processing may occur on some other pool in parallel.

viktorklang commented 4 years ago

@nadavwr

Now that ExecutionContext is batching in Scala 2.13, perhaps adding fork as an alias to blocking would be advisable? This would better convey user intent—the user intends the nested Future not to be linked to the same fiber.

fork wouldn't be semantically correct since the block of code inside blocking{} is not forked, but any locally (thread locally) enqueued Future-operations would be resubmitted to the pool, so that they can be processed by some other thread.

viktorklang commented 4 years ago

@lihaoyi One option would be to treat Future.apply as a fork (semantically) which would then remove the similarity from Future.unit.map(_ => block) in which case users could themselves select if they wanted fork-like semantics or not.

alexandru commented 4 years ago

Future.apply being a "fork" might be the correct option because users expect it.

In tutorials on Future I've seen it mentioned that Future.apply forks and that if people don't want that then Future.successful should be used.

If an alternative is desired that's only memory-safe (i.e. one that registers a Runnable with Batching in the underlying thread-pool), then that alternative should be separate from Future.apply. I'd suggest Future.trampoline 🙂

nadavwr commented 4 years ago

@viktorklang Right. Perhaps def fork[A](f: => Future[A]): Future[A] = blocking(f) then?

Future.apply does sound pleasantly unsurprising.

mushtaq commented 4 years ago

@mushtaq It depends on how/where the async web-service calls are processed—they would be initiated serially by the Global EC, but their processing may occur on some other pool in parallel.

We are using the same EC (ActorSystem.executionContext ) for all non-blocking calls, including the web-client. If I understand you @viktorklang, nested Future.sequence will not parallelise in that case. If so, that is really sad 😞

I feel that this is a very fundamental change and even the long time scala.concurrent.Future users may not have understood this implication yet.

I think some guidance on how to retain the parallel nature of nested Future.sequence calls in purely asynchronous applications will help. Is there any role of blocking in such cases? Wrapping non-blocking web-client calls with blocking does not look right. Maybe, we should use a separate EC even for non-blocking web calls?

viktorklang commented 4 years ago

@mushtaq It completely depends on how ActorSystem.executionContext is implemented.

viktorklang commented 4 years ago

@nadavwr

Right. Perhaps def fork[A](f: => Future[A]): Future[A] = blocking(f) then?

No, that doesn't help.

This would though:

https://github.com/scala/scala/pull/9129/files

lihaoyi commented 4 years ago

I don't think special casing Future.apply is the right way to go. I would expect the following to run in parallel as well:

def slow(key: String) = Future{ println(s"$key start") }.map{_ => Thread.sleep(1000); println(s"$key end"); key }

def runAsyncSerial(): Future[Seq[String]] = slow("A").flatMap { a => Future.sequence(Seq(slow("B"), slow("C"), slow("D"))) }

Is it possible to only batch things together when we're actually chaining operations on the same Future? I want this to be one batch

Future{ Future(...).map(...).map(...).map(...).map(...) }

And I want this to be two batches:

Future{ (Future(...).map(...).map(...), Future(...).map(...).map(...)) }

I think that would be intuitive and would not surprise anyone, whereas the current batching of completely unrelated futures is very surprising

jrudolph commented 4 years ago

At least for Future.apply the 2.13 behavior is unfortunate, indeed.

I think that would be intuitive and would not surprise anyone, whereas the current batching of completely unrelated futures is very surprising

AFAICS this works as expected with Viktor's patch as long as you have a Future.apply in the beginning of each branch that you want to run in parallel. I'm not sure if it's guaranteed to be run in parallel or if it is just more likely. I'm inclined to think that it's guaranteed because all those forked initial future thunks will not be batched but will be put into task submission queues of the underlying EC. And each thread will first work on its batching queue before looking at other tasks, so the parallel tasks should be up for grabs for other threads.

Is it possible to only batch things together when we're actually chaining operations on the same Future? I want this to be one batch
Future{ Future(...).map(...).map(...).map(...).map(...) }
And I want this to be two batches:
Future{ (Future(...).map(...).map(...), Future(...).map(...).map(...)) }

The first example doesn't make sense because there's a single dependency chain between all tasks, so, of course, they must be run sequentially. The question is more if in

val a : Future[...] == ...
Future.sequence(a.map(longRunning), a.map(longRunning2))

those two long-running tasks would be run in parallel because they depend on the same original future. If that's not working with Viktor's patch, I'd still find it acceptable if it would require to use Future.apply around those long running chunks of work as well.

In general, it's tempting to simply use Future.apply or (map or whatever) for long-running CPU intensive tasks for parallelization but for best performance you will always need fine tuning for these kind of loads. Putting long running tasks on the main EC of your application will have almost the same bad effects of thread starvation as blocking tasks. Using blocking still wouldn't be the right thing because you exactly don't want to spawn extra threads if you are already CPU-bound. In many cases, you might want to either break down tasks into smaller pieces or run them on dedicated ECs to keep the main ECs free for asynchronous task handling.

lihaoyi commented 4 years ago

AFAICS this works as expected with Viktor's patch as long as you have a Future.apply in the beginning of each branch that you want to run in parallel.

This is good enough. As you said, if both sides starts with a : Future[...], I would expect the maps to run in parallel. Future.apply shouldn't be special, what I want is unrelated tasks in the computation graph to run in parallel.

In general, it's tempting to simply use Future.apply or (map or whatever) for long-running CPU intensive tasks for parallelization but for best performance you will always need fine tuning for these kind of loads. Putting long running tasks on the main EC of your application will have almost the same bad effects of thread starvation as blocking tasks. Using blocking still wouldn't be the right thing because you exactly don't want to spawn extra threads if you are already CPU-bound. In many cases, you might want to either break down tasks into smaller pieces or run them on dedicated ECs to keep the main ECs free for asynchronous task handling.

This may or may not be true, but it's somewhat irrelevant here: these problems are well known, and well understood, well beyond the Scala ecosystem. I can take a colleague who's familiar with Node.js, Java's ThreadPoolExecutors, or Python's multiprocessing.pool and have them immediately understand the problems with thread starvation.

What I'm asking for isn't to magically solve thread starvation, what I'm asking for is to have a predictable behavior that unrelated parts of an asynchronous computation will proceed in parallel, as far as the threads allow. This was the case for the global ExecutionContext for 7 years from Jan 2013, and only changed in June 2019. We can wring our hands forever about what the "ideal" solution would be, but ExecutionContext.global was a working solution that I had put into production many times over the past 7 years, and used in countless onboarding and teaching material. I'm rather unhappy to see my 7 years of production systems, blog posts, tech talks, and a book so casually broken in the name of saving a few nanoseconds.

jrudolph commented 4 years ago

saving a few nanoseconds

What I wanted to say is that in the end, there are at least two different use cases for Futures: one is efficient management of asynchronous callbacks (that are usually short running) vs. using the same infrastructure for efficiently running long-running tasks. Maybe people who mostly run things with async in mind are rather glad about the performance improvements (also because they already manage the long-running tasks in special ways anyway). So, it really is a trade-off that probably went too far with the change in 2.13. With the suggested change to Future.apply, maybe the default solution is better balanced for most cases? But I wouldn't want to be the judge here, it might be that there's no solution that fits all (in which case avoiding breaking stuff might be the better choice).

We can wring our hands forever about what the "ideal" solution would be, this was a working solution

I think we need to wring out hands exactly, because things always change and we need to argue about what improvements we can make and which ones we might not want to make. Stability is a important but there must be room for improvements.

eed3si9n commented 4 years ago

.. ExecutionContext.global was a working solution that I had put into production many times over the past 7 years, and used in countless onboarding and teaching material. I'm rather unhappy to see my 7 years of production systems, blog posts, tech talks, and a book so casually broken in the name of saving a few nanoseconds.

+1 on Haoyi here.

Making any changes to the standard library should be done in a careful way such that it won't change the semantics of production code especially for a popular feature like ExecutionContext.global. Given that the semantics can change with bringing in new ExecutionContext, would it be possible to make the BatchedExecutor an explicit opt-in instead?

viktorklang commented 4 years ago

Some clarifications seem to be in order,

AFAICS there is no semantical change—there is a situation where long-running/blocking operations executed in a nested fashion without blocking{}-demarcation are executed serially rather than in parallel. So the observable behavior is that it will execute slower, not semantically different (as there is never any guarantee that there will be available threads). Note that this only applies for ExecutionContext.global (at this point). Although inconclusively, the fact that this Issue was raised now rather than during the 2.13 RC cycle or .0, .1 etc Seems to support the above claim.

Making ’global’ be opt-in w.r.t. batching doesn’t make much sense (after thinking about it), since you would have no idea about what code would make what assumption. ’global’ has to have an open-world assumption since it has 0 knowledge about the workloads it will see at runtime—and it is indeed a tricky thing to make something general also optimal.

Ultimately Futures can make no guarantee about parallelism since it is up to the EC to decide, and a potential solution here is to provide more hints to ’global’.

Now, in an effort to increase the potential parallelism it is possible to make Future.apply behave as in my linked PR—and I would be beyond thrilled to get feedback on that approach!

Cheers, V

-- Cheers, √

Jasper-M commented 4 years ago

@viktorklang

Now, in an effort to increase the potential parallelism it is possible to make Future.apply behave as in my linked PR—and I would be beyond thrilled to get feedback on that approach!

IIUC that would mean:

Future{ (Future(A).map(B).map(C), Future(D).map(E).map(F)) }

Future{ Future(G).map(H).flatMap{_ => I; Future(J).map(K)} }

// A, B, C run in one batch.
// D, E, F run in one batch.
// G, H, I run in one batch.
// J, K run in one batch.

That seems totally acceptable to me. Almost the best of both worlds I guess. Ideally you'd probably want A and B in Future(A).flatMap(_ => Future(B)) to run in one batch, but then you could get in trouble again with stuff like Future(...).flatMap{_ => Future(slow code here); more slow code; Future(...)}. So IMHO it looks like a good compromise.

@lihaoyi

I don't think special casing Future.apply is the right way to go. I would expect the following to run in parallel as well:
def slow(key: String) = Future{ println(s"$key start") }.map{_ => Thread.sleep(1000); println(s"$key end"); key }

def runAsyncSerial(): Future[Seq[String]] = slow("A").flatMap { a => Future.sequence(Seq(slow("B"), slow("C"), slow("D"))) }

Unless I'm missing something here B, C and D can run in parallel in this piece of code, no?

jrudolph commented 4 years ago

AFAICS there is no semantical change—there is a situation where long-running/blocking operations executed in a nested fashion without blocking{}-demarcation are executed serially rather than in parallel.

Not sure I can follow that formal argument. It's a rather weak guarantee that only the result will be the same but not the rough runtime / execution behavior. If that's the standard, why bother with improving performance in the first place?

Imo the only backing to change behavior in such a radical way in a minor release, would be to show real world, non-synthetic benchmarks that demonstrate that batching is significantly beneficial in more cases than it hurts in cases like the given one (if that comparison can be done). Of course, the Future.apply case is the most important one, but since Futures are a general tool and you cannot foresee usages, it would be least surprising if all combinators would show the same behavior.

lihaoyi commented 4 years ago

AFAICS there is no semantical change—there is a situation where long-running/blocking operations executed in a nested fashion without blocking{}-demarcation are executed serially rather than in parallel. So the observable behavior is that it will execute slower, not semantically different (as there is never any guarantee that there will be available threads)

This is a distinction without a difference. Adding Thread.sleep(1000) into every Future operation would be a perfectly compatible change by your definition, and it would break our systems just like this change did. This change in particular caused multiple user-facing bugs in our systems: I'm not speaking about hypotheticals when I say some bugs took months to figure out, and with visible user-facing degradation of our service the whole time.

I doubt I'm the only one who'll hit such issues upgrading to 2.13.x. After all, we did spend half a decade telling people to import ExecutionContext.Implicits.global, and I can probably could the number of people I've personally met who know how to properly use blocking{...} blocks on one hand. Everyone in between is going to have the same issues I had.

Although inconclusively, the fact that this Issue was raised now rather than during the 2.13 RC cycle or .0, .1 etc Seems to support the above claim.

Fun fact: this issue was actually noticed by our users in the 2.13.1 release, but it took us until 2.13.3 for us to figure out what's going on and open this issue! It turns out that subtle, silent and totally-unannounced performance regressions that only occur under load are not easy issues to investigate. You are effectively arguing that because the bug was so tricky to investigate, therefore it isn't a real issue.

There are many ways this change could have been rolled out without breakage: opt-in via an environment variable or JVM property (as suggested by @eed3si9n), opt-in by importing ExecutionContext.Batching.batching instead of ExecutionContext.Implicits.global. Those approaches could still be implemented today in preparation for 2.13.4, at least mitigating the situation for people upgrading directly from 2.12 -> 2.13.4.

I'd love to tell people to use the Scala standard library ExecutionContext.global for a variety of use cases, as I have been doing heavily for the past 7 years, and was planning on doing so for the next 7 years in my book. But given this breaking change with no warning, change-management plan, deprecation period, backwards compatibility story, or easy migration path, I'm forced to consider my alternatives (honestly none of them great).

lihaoyi commented 4 years ago

To answer @Jasper-M

IIUC that would mean:

Future{ (Future(A).map(B).map(C), Future(D).map(E).map(F)) }

Future{ Future(G).map(H).flatMap{_ => I; Future(J).map(K)} }

// A, B, C run in one batch. // D, E, F run in one batch. // G, H, I run in one batch. // J, K run in one batch. That seems totally acceptable to me. Almost the best of both worlds I guess. Ideally you'd probably want A and B in Future(A).flatMap( => Future(B)) to run in one batch, but then you could get in trouble again with stuff like `Future(...).flatMap{ => Future(slow code here); more slow code; Future(...)}. So IMHO it looks like a good compromise.

What about the following?

val p = Promise[T]
val f = p.future
Future{ (f.map(B).map(C), f.map(E).map(F)) }

There's no Future.apply calls at all here; I still expect B and C to be able to run in parallel to E and F

In the end Future.apply is equivalent to Future.unit.map, not just in the code, but also in people's understanding and expectations. This is something that @viktorklang has been repeatedly harping on over the years, and is a great foundation to build understanding. I don't think breaking that equivalence is the right thing to do: it adds complexity to an otherwise clean underlying programming model, and as the example above demonstrates doesn't fix the problem at all

Jasper-M commented 4 years ago

@lihaoyi That's a good counter-example. If I understand the PR correctly you'd have to wrap f in Future.delegate.

Future{ (Future.delegate(f).map(B).map(C), Future.delegate(f).map(E).map(F)) }

Not a great solution though, cause now—even more than with Future.apply—you'd have to know about all these implementation details in order to adapt your code.

viktorklang commented 4 years ago

I want to clarify to readers of this Issue that the 2.13 ’global’ EC change can cause a (noticable!) performance regression, not a semantical regression. As performance regressions are not desirable, we’re investigating here in this Issue what can be done about it.

In order to remain constructive and solution-oriented, let’s figure out a path forward.

Current options are:

A) Remove BatchingExecutor from ’global’ (will be a performance regression for other use-cases, and I wouldn’t be surprised if someone opens an Issue about that once they notice) B) Make batching opt-in via sys props (will be a performance regression requiring to add said sys prop) C) Make batching opt-out (requires adding said sys prop if one notices a problem) D) Make Future.apply not batchable as seen in PR (unknown performance impact) E) Somehow make it possible to work-steal from the current Batch? (Unlikely given on-stack execution etc) F) Other constructive suggestions are most welcome here!

Also, it is clear that the behavior of the global EC needs more documentation, so that reliance on un(der)documented behavior can be avoided. Also, we need to advertise ’blocking{}’ in more places so that pools do not suffer from thread starvation.

Cheers, V -- Cheers, √

ewoutvonk commented 4 years ago

I didn't get the full gist of the thread yet, only skimmed for now (didn't run into this yet, still on 2.12.x). Need to dive in further for option D. But the core of the matter is what @lihaoyi stated (production impact):

But given this breaking change with no warning, change-management plan, deprecation period, backwards compatibility story, or easy migration path, I'm forced to consider my alternatives (honestly none of them great).

Semantic Versioning 2.0.0 states:

Major version X (X.y.z | X > 0) MUST be incremented if any backwards incompatible changes are introduced...

So to me, the deprecation period would be 2.13.x and having 3.0 make the incompatible change, allowing for a smoother migration path. So I would say:

option B for 2.13.x, with deprecation warning (with link or explanation on how to turn on sys prop)
option C for 3.x (if applicable, don't know details of 3.x yet) if desirable to still support old behavior

And, if at all possible, allow to switch to option D in one way or the other, so it can be A/B tested in production-like loads (only if people find this option interesting).

dwijnand commented 4 years ago

@lihaoyi

Fun fact: this issue was actually noticed by our users in the 2.13.1 release, but it took us until 2.13.3 for us to figure out what's going on and open this issue!

Is that because they upgraded straight to 2.13.1? Because the change happened in https://github.com/scala/scala/releases/tag/v2.13.0.

@ewoutvonk

Semantic Versioning 2.0.0 states:

Major version X (X.y.z | X > 0) MUST be incremented if any backwards incompatible changes are introduced...

So to me, the deprecation period would be 2.13.x and having 3.0 make the incompatible change

For a number of reasons (including just historical), Scala's versioning is Epoch.Major.Minor, so 2.13.0 was the latest major version release of Scala.

lihaoyi commented 4 years ago

Is that because they upgraded straight to 2.13.1?

Yes, it's because we upgraded straight to 2.13.1. Version upgrades take time, and are just one among many professional priorities. I expect to be doing "upgrading to 2.13.x" on and off for the next few years, just as I have been "upgrading to 2.12.x" on and off for the last few. We'll be upgrading to whatever the latest point release is unless there's a compelling reason not to.

Part of the reason I'm making noise about this now, when 2.13.3 is already out, is because I expect a large majority of the Scala community has not upgraded past 2.12. The recent Jetbrains survey claims 40% is on 2.13, and I'm guessing that's a rather optimistic count: the Spark community just got official 2.12 support a few months ago, and the bulk of them are likely on 2.11, upgrading at their own schedule. While this regression may have been a "fait accompli" for the past year, people hitting this for the first time as they upgrade to 2.13 (and realizing they are hitting this; took us 6 months!) has likely only started.

dwijnand commented 4 years ago

Agreed. I didn't mean to imply surprise or disbelief that you didn't immediately upgrade to 2.13.0 😄 I just want to avoid a misunderstanding that the change was in 2.13.1. It's unfortunate that this has only come up now rather than pre-release, but it's a common occurrence in software in general. Thank you for the feedback.

AkashicSeer commented 4 years ago

Breaking, unannounced changes like this ruin trust in languages. It makes me question whether I want to use Scala on large projects. What if another silent, unannounced breaking change happens in the future? I don't feel I can trust the language. The best thing you the Scala team could do is never do any unannounced, undocumented changes.

viktorklang commented 4 years ago

This was not a silent, unannounced change as it was released from the first Scala 2.13.0-RC1 (a new major version) and listed in the release notes: https://github.com/scala/scala/releases/tag/v2.13.0-RC1

It has already been made clear that performance regressions are undesirable, so again, let’s focus on possible paths forward and stay constructive.

Thanks, V -- Cheers, √

SethTisue commented 4 years ago

and listed in the release notes: https://github.com/scala/scala/releases/tag/v2.13.0-RC1

And in https://github.com/scala/scala/releases/tag/v2.13.0, too — though if the content and wording there can be improved, it certainly isn't too late to do so. As others have pointed out, 2.13 adoption is a long-term, ongoing process. People will be consulting the 2.13.0 release notes for years to come.

@AkashicSeer Please don't use scala/bug for this kind of commentary. I have hidden your comment as off-topic.

Tickets here should stay focused on specific technical problems and proposed solutions, not on wider issues around release planning, adoption, and so forth. There's plenty of other places to have those discussions.

jimm-porch commented 4 years ago

Before I say anything let me say I respect everyone on this thread, and have learned a ton from many of you (via YouTube, blog posts, Twitter, etc). I'm a big fan of ScalaTags. And I'm reluctant to jump in, in case I'm misunderstanding a deeper technical issue here, but I'm going to be foolish this time and probably embarrass myself somehow.

I started using Scala and futures about 5 years ago with v2.11, and I've always been under the impression that we were not supposed to run any long-running code inside of the default ExecutionContext. I feel like that's been pretty clear in the docs since at least 2.11.

The 2.11 scala.concurrent package Scaladocs point to this documentation. In the section on the global EC:

The number of concurrently blocking computations can exceed the parallelism level only if each blocking call is wrapped inside a blocking call (more on that below). Otherwise, there is a risk that the thread pool in the global execution context is starved, and no computation can proceed.

Last but not least, you must remember that the ForkJoinPool is not designed for long lasting blocking operations.

I have used this simple approach with new Scala developers for many years:

If the result is already available, use Future.successful
If the computation is very very quick, use Future { ... }
If the computation is slow (lifting a Java client API call into a Future, etc), use Future(blocking { ... })
If the computation is slow and you are nervous about spawning a ton of JVM threads all at once, use a custom EC with a fixed thread pool, instead of the default EC.

My point is - the documentation has been telling us since 2.11 that if we use the default EC, we should only run long-running operations inside blocking. It doesn't seem like the library authors are breaking a major contract by changing the behavior in this way for 2.13 since the documentation has said "please don't use it this way" for 2 major versions.

eed3si9n commented 4 years ago

I followed the discussion leading up to Scala 2.13.0 fairly closely, partly because there were a few changes I've put in, and also because I was in Scala team overhearing discussions etc, but personally I was not aware of that the following line

Made the global ExecutionContext “batched”. (#7470)

implied this observable and breaking changes discussed in this issue.

Even today looking at https://github.com/scala/scala/pull/7470, it just says that that the performance is better, but it doesn't say anywhere that code that used to run in parallel will run sequentially.

The documentation states that

Futures provide a way to reason about performing many operations in parallel – in an efficient and non-blocking way.

If we can't tell what runs in parallel, I'd say we cannot reason about the operations in parallel. Regarding ExecutionContext.global it says that it's backed by ForkJoinPool, but the change that Victor mentions:

With the default EC it will parallelize "top-most" Futures (assuming that it has more than one thread) and break off eligible "callbacks" once a blocking{} is encountered.

I don't think was implied in the overview. To some degree, whether something is blocking or blocking { ... } is a red herring. As Haoyi's example showed, what becomes top-level could change depending on the structure of the code:

val p = Promise[T]
val f = p.future
Future{ (f.map(B).map(C), f.map(E).map(F)) }

IMO, the safest thing to do would be to introduce an opt-in like import ExecutionContext.topLevelOnly and make ExecutionContext.global behave the same as Scala 2.12.x. (The other option would be to deprecate ExecutionContext.global and force everyone to pick one but that didn't go so well for IEEE sorting order.)

alexandru commented 4 years ago

First of all introducing "batching" executors was the right decision. We've been using batching executors in Monix for a long time now. The problem isn't global, in my opinion the problem is Future's API.

On the proposed solutions:

A) Remove BatchingExecutor from ’global’ (will be a performance regression for other use-cases, and I wouldn’t be surprised if someone opens an Issue about that once they notice)

I would not go with this, first of all because ExecutionContext.Implicits.global isn't the only thread-pool that people use. For example, what about Akka's fork-join dispatchers? There's also Monix's Scheduler, which is a drop-in replacement for ExecutionContext, which are already batched and could make use of the new Batchable. It would be a shame if we couldn't, because people expect Future.apply to fork.

As soon as people initialize, by mistake, a batching executor, the problem is coming back with a vengeance.

B) Make batching opt-in via sys props (will be a performance regression requiring to add said sys prop) C) Make batching opt-out (requires adding said sys prop if one notices a problem)

I think we should avoid any solution based on system properties. System properties are meant for optimizations.

When people expect some tasks to run in parallel, because it worked like that for the last 7 years, but it no longer doesn't, this is no longer an optimization, rather it becomes a correctness issue.

D) Make Future.apply not batchable as seen in PR (unknown performance impact)

Yes, this is in my opinion the obviously correct approach (even if it's debatable if that's enough).

But I would go further: I would deprecate Future.apply because it would be a confusing name for what it does and then introduce 2 new operations:

Future.fork
Future.trampoline (or Future.batch?)

And if the deprecation of Future.apply isn't acceptable, I would still introduce Future.trampoline. And then you could change the docs to say that Future.trampoline(f) <-> Future.unit.map(_ => f).

viktorklang commented 4 years ago

And if the deprecation of Future.apply isn't acceptable, I would still introduce Future.trampoline. And then you could change the docs to say that Future.trampoline(f) <-> Future.unit.map(_ => f).

@alexandru There would be no guarantee that the execution would be trampolined though—as it is up to the EC to decide.

mdedetrich commented 4 years ago

Here are my many cents

There was never a guarantee that any code running Future would either be sequential or parallel because of its core design with ExecutionContext. The whole point of Future is it lets you run computations on defined executors and its up to executors to determine if something is ran in parallel or sequential (I am agreeing with @viktorklang here). If you want to be explicit about concurrency (i.e. what is run sequentially and what is run in parallel) and also have the ability to reason about it then you should look at IO types such as Monix Task or FS2 IO. Also a lot of people are confusing performance with concurrency, they aren't the same thing. If you do want to be explicit about concurrency with Future then explicitly specify ExecutionContext's to the map/flatMap arguments (see point 3).
Don't use Thread.sleep ever when using Future. This is bad/brittle design that was bound to blow up at some point. You should be using something like akka.pattern.after if you want a Thread.sleep like mechanism when working with Future. If you do really need to use long blocking calls (such as Thread.sleep) then you should be using a specific ExecutionContext/blocking for this (as is clearly documented and pointed out here https://github.com/scala/bug/issues/12089#issuecomment-663232629)
Don't import an implicit global ExecutionContext wherever you use Future. This is also bad design. Instead you should be passing around ExecutionContext implicitly and at least for typical apps just supply the ExecutionContext in one place (i.e. in Main). If you were doing this, then even with this "breaking behavior" the fix would literally be one line of code. This design is also intentional because one of the points of Future is that you may want to override a specific computation with a custom ExecutionContext (i.e. good example is in GUI applications where you have an ExecutionContext being backed by a UI thread and another general ExecutionContext. Any calls to update the UI can be done by using map/flatMap and providing the UI ExecutionContext explicitly).
The latest change was done because Future had terrible performance for hot sequential loops, which basically means we traded one performance default for another. Not to point fingers here, but Future received a lot of unjustified bad press from certain actors creating deliberate scenarios/presentations/etc that made Future look bad (i.e. a presentation that showed how slow Future is when calculating Fibonnaci via Future map calls even though the main use case of Future was not as a general purpose IO type like in Haskell but rather a way to deal with HTTP/TCP/etc etc calls efficiently). In other words, I am pretty sure if people didn't deliberately misrepresent Future in a bad light then @viktorklang probably wouldn't have felt "pushed" to change the default ExecutionContext to a batching one (although a batching one should have still been implemented).

@lihaoyi At the risk of sounding blunt/direct, I think the issue you are describing is an actual non issue. Although there is an argument that behavior was changed, this argument is weaker if you misued a certain abstraction (in this case Future) and rely on this misuse (whether intentional or not). I personally don't have a problem with changing back the default ExecutionContext, but at the same time; from the sounds of it; the code is written in a way to rely on unclear behavior and this will always be a problem. You will likely get other issues in the Future when something else changes, even if it is once in 7 years.

alexandru commented 4 years ago

@viktorklang

@alexandru There would be no guarantee that the execution would be trampolined though—as it is up to the EC to decide.

Doesn't matter, in practice most execution contexts will eventually be batched, because the advantages are obvious. And people need to be able to specify via the API a hard fork.

Naming is hard though for the trampoline. Maybe name it Future.eval? As long as you can specify via the API which is a hard fork and which isn't.

alexandru commented 4 years ago

@mdedetrich

There was never a guarantee that any code running Future would either be sequential or parallel because of its core design with ExecutionContext.

The purpose of Future is to describe results of asynchronous tasks and as such those tasks need the ability to run either sequentially or in parallel, with the Future being in charge of combining the results, otherwise there's no point in using it. You are right that the guarantees for the parallelism of Future.sequence(Future(1), Future(2), Future(3)) was never a guarantee. But that's a problem with Future's API that needs to be fixed, as we need a way to specify "run these Future-producing tasks in parallel".

Note that it doesn't really matter if the guarantee was there or not. If that's how people ended up using it in practice and it's been used like that for the last 7 years, then it is de facto standard.

So at this point we should do something to fix this, while allowing for the new optimizations, which I like very much too.

As I mentioned above, the problem IMO is the behavior of Future.apply. Let's leave Future.apply as it was before the change, and introduce a new API for batched evaluation.

mdedetrich commented 4 years ago

@alexandru

The purpose of Future is to describe results of asynchronous tasks and as such those tasks need the ability to run either sequentially or in parallel, with the Future being in charge of combining the results, otherwise there's no point in using it.

The thing is, this is still the case assuming you are still using Future properly. The batching ExecutionContext that @viktorklang made default is only meant to batch extremely fast instructions so they execute on the same thread. Future's with the batching ExecutionContext will still run asynchronously as long as you don't misuse it (i.e. putting Thread.sleep in a Future without blocking). So actually if you used Future properly then there wouldn't have been a problem.

Note that it doesn't really matter if the guarantee was there or not. If that's how people ended up using it in practice and it's been used like that for the last 7 years, then it is de facto standard.

Well thats the problem. People were misusing Future (this misuse was clearly documented) and then get upset when their code breaks. I understand the problem with code changing behavior, but if we keep up this attitude then the Future abstraction will become entirely useless and unable to reason with (which you pointed out earlier).

And this is really the issue, an abstraction is only as useful as the guarantees/contracts it can provide. If people misuse abstractions and then people rely on these misuses then the abstractions can't properly evolve.

I mean if Scala had a decent linting history, doing something like

Future { Thread.sleep(1000) }

Would have thrown a linting warning saying "Please use blocking when putting blocking functions inside of Future otherwise such behavior is undefined"

As I mentioned above, the problem IMO is the behavior of Future.apply. Let's leave Future.apply as it was before the change, and introduce a new API for batched evaluation.

As mentioned before, this is not going to fix the problem and it will likely cause even more problems because its changing many more things. If you don't want to break peoples code then I would revert the change to the default ExecutionContext and maybe change it in a Future scala version once we put in warnings like the one mentioned above so its not such a surprise for people using Future.

Jasper-M commented 4 years ago

@mdedetrich I don't think you can see a Future { blockingOperation() } or Future { veryLongComputation() } in isolation and conclude that the programmer is doing something wrong. Whether it's problematic or not depends on the ExecutionContext that is being used.

It used to be the case that in practice ExecutionContext.global worked fine for parallelizing Future { veryLongComputation() } as long as you made sure no other parts of your application were scheduling futures on it. Though using a fixedThreadPool would probably have been better.

mdedetrich commented 4 years ago

@Jasper-M It was documented as incorrect usage in Future's documentation regardless of the ExecutionContext being used. Also in Scala literature its also stated that you should never do blocking calls in Future because even if it did work you would easily create other problems (i.e. thread starvation).

The whole point of blocking is that the ExecutionContext is completely unaware of what code is being run through it so blocking is meant to signal that "this is a long running thread blocking operation", a lot of other IO types also need similar functionality for the same reason. If no one uses blocking then you get problems which actually the latest change in Scala 2.13.x was meant to solve (ironically).

It used to be the case that in practice ExecutionContext.global worked fine for parallelizing Future { veryLongComputation() }

Yes, but thats a different discussion. The issue is not about having a

Future { veryLongComputation()}

but rather

Future { veryLongBlockingComputation()}

The former is fine and causes no issues with the new default batching ExecutionContext since it detects such scenarios (i.e. if it sees that its long enough then it will execute it asynchronously). The issue is purely with the latter, which is a misuse.

scala / bug

Nested Future blocks do not run in parallel in Scala 2.13.x #12089

reproduction steps