Open iRevive opened 3 years ago
The service consumes messages from the queue/socket (Kafka, ZMQ, etc) and broadcasts every message to N different pipelines.
Perhaps I'm doing it wrong and I don't need broadcastTo(...)
at all? Can I achieve similar behavior using a different API?
I was thinking about fs2.concurrent.Queue
per pipeline as an alternative solution.
I haven't measured perfomance of akka-streams yet, but I've repeated almost everything you've written about fs2 using jmh. This is the benchmark
@State(Scope.Thread)
class BroadcastBenchmark {
val executor = Executors.newSingleThreadExecutor
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.fromExecutor(executor))
implicit val concurrent: Concurrent[IO] = IO.ioConcurrentEffect
@Param(Array("1000", "100000"))
var streamSize: Int = _
val fs2Stream = Stream.range(1, streamSize).covary[IO] // Every number goes to it own chunk
val drainPipe: Pipe[IO, Int, INothing] = _.drain
@Benchmark
def noPipe(): Unit =
fs2Stream.compile.drain.unsafeRunSync()
@Benchmark
def through(): Unit =
fs2Stream.through(drainPipe).compile.drain.unsafeRunSync()
@Benchmark
def broadcastTo1(): Unit =
fs2Stream.broadcastThrough(1)(drainPipe).compile.drain.unsafeRunSync()
@Benchmark
def broadcastTo10(): Unit =
fs2Stream.broadcastThrough(10)(drainPipe).compile.drain.unsafeRunSync()
@Benchmark
def broadcastTo100(): Unit =
fs2Stream.broadcastThrough(100)(drainPipe).compile.drain.unsafeRunSync()
@TearDown
def shutdown(): Unit = executor.shutdown()
}
This is the result:
[info] Benchmark (streamSize) Mode Cnt Score Error Units
[info] AkkaStreamBenchmark.broadcastTo1 1000 thrpt 5 4245,855 ± 1822,801 ops/s
[info] AkkaStreamBenchmark.broadcastTo1 100000 thrpt 5 4369,746 ± 117,701 ops/s
[info] AkkaStreamBenchmark.broadcastTo10 1000 thrpt 5 1097,015 ± 32,263 ops/s
[info] AkkaStreamBenchmark.broadcastTo10 100000 thrpt 5 1104,885 ± 113,161 ops/s
[info] AkkaStreamBenchmark.broadcastTo100 1000 thrpt 5 119,786 ± 1,063 ops/s
[info] AkkaStreamBenchmark.broadcastTo100 100000 thrpt 5 121,689 ± 19,877 ops/s
[info] AkkaStreamBenchmark.noPipe 1000 thrpt 5 1028299,135 ± 148425,624 ops/s
[info] AkkaStreamBenchmark.noPipe 100000 thrpt 5 1054733,488 ± 66014,908 ops/s
[info] AkkaStreamBenchmark.through 1000 thrpt 5 405338,190 ± 4622,959 ops/s
[info] AkkaStreamBenchmark.through 100000 thrpt 5 382236,503 ± 37407,147 ops/s
The most interesting part of the result is that broadcast1
is one hundred times slower, than through
, in spite of being semantically equivalent.
@nikiforo thank you for the benchmark. The difference between through
and broadcast
is significant.
Am I reading the benchmark output lines for streamSize
∈ {1000, 100000} correctly that even though we’re processing 100 times more elements in the streamSize
= 100000 run, the runtime is more or less the same as with streamSize
= 100?
Am I reading the benchmark output lines for
streamSize
∈ {1000, 100000} correctly that even though we’re processing 100 times more elements in thestreamSize
= 100000 run, the runtime is more or less the same as withstreamSize
= 100?
@balthz,
:blush: seems that fs2Stream
should be setted in @Setup
, I'll fix it and resubmit
@balthz, I've changed two things in the Benchmark:
The benchmark:
@State(Scope.Thread)
class BroadcastBenchmark {
val jExecutor = Executors.newSingleThreadExecutor
val sExecutor = ExecutionContext.fromExecutor(jExecutor)
val blocker = Blocker.liftExecutionContext(sExecutor)
implicit val cs: ContextShift[IO] = IO.contextShift(sExecutor)
implicit val concurrent: Concurrent[IO] = IO.ioConcurrentEffect
val drainPipe: Pipe[IO, Int, INothing] = _.drain
@Param(Array("10", "1000", "100000"))
var streamSize: Int = _
var fs2Stream: Stream[IO, Int] = _
@Setup
def start(): Unit = {
fs2Stream = Stream.range(1, streamSize).covary[IO] // Every number goes to it own chunk
}
@Benchmark
def noPipe(): Unit = execute(fs2Stream)
@Benchmark
def through(): Unit = execute(fs2Stream.through(drainPipe))
@Benchmark
def broadcastTo1(): Unit = execute(fs2Stream.broadcastThrough(1)(drainPipe))
@Benchmark
def broadcastTo10(): Unit = execute(fs2Stream.broadcastThrough(10)(drainPipe))
@Benchmark
def broadcastTo100(): Unit = execute(fs2Stream.broadcastThrough(100)(drainPipe))
@TearDown
def shutdown(): Unit = jExecutor.shutdown()
private def execute[O](s: Stream[IO, O]) = blocker.blockOn(s.compile.drain).unsafeRunSync()
}
The result
[info] Benchmark (streamSize) Mode Cnt Score Error Units
[info] BroadcastBenchmark.broadcastTo1 10 thrpt 7 3011,053 ± 168,120 ops/s
[info] BroadcastBenchmark.broadcastTo1 1000 thrpt 7 107,267 ± 3,405 ops/s
[info] BroadcastBenchmark.broadcastTo1 100000 thrpt 7 1,110 ± 0,023 ops/s
[info] BroadcastBenchmark.broadcastTo10 10 thrpt 7 642,015 ± 37,393 ops/s
[info] BroadcastBenchmark.broadcastTo10 1000 thrpt 7 15,928 ± 0,631 ops/s
[info] BroadcastBenchmark.broadcastTo10 100000 thrpt 7 0,159 ± 0,006 ops/s
[info] BroadcastBenchmark.broadcastTo100 10 thrpt 7 51,828 ± 3,519 ops/s
[info] BroadcastBenchmark.broadcastTo100 1000 thrpt 7 0,809 ± 0,077 ops/s
[info] BroadcastBenchmark.broadcastTo100 100000 thrpt 7 0,008 ± 0,001 ops/s
[info] BroadcastBenchmark.noPipe 10 thrpt 7 86335,522 ± 2813,905 ops/s
[info] BroadcastBenchmark.noPipe 1000 thrpt 7 8454,641 ± 520,836 ops/s
[info] BroadcastBenchmark.noPipe 100000 thrpt 7 96,398 ± 1,572 ops/s
[info] BroadcastBenchmark.through 10 thrpt 7 55143,756 ± 2378,198 ops/s
[info] BroadcastBenchmark.through 1000 thrpt 7 2022,979 ± 62,640 ops/s
[info] BroadcastBenchmark.through 100000 thrpt 7 20,612 ± 0,794 ops/s
Ration of throughputs BroadcastBenchmark.broadcastTo1
to BroadcastBenchmark.through
is now 1 : 20
@nikiforo I have run today the benchmarks, using the current main
branch (to be 3.0.0 release). It seems that the problem has been solved in this release. Given the differences between these two branches, not sure if any fix can easily be backported.
[info] Benchmark (streamSize) Mode Cnt Score Error Units
[info] BroadcastBenchmark.broadcastTo1 10 thrpt 7 59585.114 ± 779.182 ops/s
[info] BroadcastBenchmark.broadcastTo1 1000 thrpt 7 59281.966 ± 414.307 ops/s
[info] BroadcastBenchmark.broadcastTo1 100000 thrpt 7 57560.715 ± 5457.228 ops/s
[info] BroadcastBenchmark.broadcastTo10 10 thrpt 7 58253.115 ± 779.112 ops/s
[info] BroadcastBenchmark.broadcastTo10 1000 thrpt 7 57614.055 ± 3648.170 ops/s
[info] BroadcastBenchmark.broadcastTo10 100000 thrpt 7 54289.644 ± 11283.495 ops/s
[info] BroadcastBenchmark.broadcastTo100 10 thrpt 7 56528.557 ± 324.216 ops/s
[info] BroadcastBenchmark.broadcastTo100 1000 thrpt 7 57334.595 ± 1157.871 ops/s
[info] BroadcastBenchmark.broadcastTo100 100000 thrpt 7 57276.760 ± 558.260 ops/s
[info] BroadcastBenchmark.noPipe 10 thrpt 7 60658.243 ± 1057.896 ops/s
[info] BroadcastBenchmark.noPipe 1000 thrpt 7 60790.862 ± 363.710 ops/s
[info] BroadcastBenchmark.noPipe 100000 thrpt 7 61149.630 ± 466.613 ops/s
[info] BroadcastBenchmark.through 10 thrpt 7 59219.496 ± 581.634 ops/s
[info] BroadcastBenchmark.through 1000 thrpt 7 59507.022 ± 326.828 ops/s
[info] BroadcastBenchmark.through 100000 thrpt 7 58832.862 ± 248.395 ops/s
Well I'm about to reimplement broadcastThrough
anyway (and remove broadcast
and Broadcast
). I do wonder why the performance has changed, I don't think we've touched it yet, maybe it's just ce3 being faster
val executor = Executors.newSingleThreadExecutor
mmm
@SystemFw So, the benchmarks that I run with the new version may have been different, on the translation of CE2 to CE3. https://github.com/typelevel/fs2/pull/2296. It does not seem as if that single threaded executor got applied to them.
val executor = Executors.newSingleThreadExecutor
mmm
Actually that was intentional! :))) And I want to discuss this a bit, if you don't mind.
I'm not sure whether that is correct. Moreover, I've seen that .global
is usually used in benchmarks. I've read the recommendation to have at least two threads to eliminate halts(but from my POV this rule should be applied to real code, not benchmarks).
However, I think that it's more fair to compare perfomance(at least in this particular example) using single threaded concurrent.
I've also tried to goolge smth out with jmh
, thread pool
etc, but found nothing. Do you have any guidelines or recommendations about jmh-benchmarking and thread pools? And why singleThreadExecutor
is a no go?
@SystemFw @diesalbla
@nikiforo I have run today the benchmarks, using the current
main
branch (to be 3.0.0 release). It seems that the problem has been solved in this release.```
@diesalbla, I expected to see broadcastTo100
ten times slower than broadcastTo10
, which in turn should be ten times slower than broadcastTo1
:confused:
Moreover I thought that time should increase linear to the size of the supplied stream :fearful:
@nikiforo The reason I commented that is that a large amount of Cats Effect 3 performance improvements (in real world scenarios, not just flatMap loops) is due to its new work stealing thread pool, so if you don't use it you might miss out on significant perf
@diesalbla @SystemFw,
package fs2
package benchmark
import cats.effect.IO
import cats.effect.unsafe.implicits.global
import org.openjdk.jmh.annotations.{Benchmark, Param, Scope, State}
@State(Scope.Thread)
class BroadcastBenchmark {
val drainPipe: Pipe[IO, Int, INothing] = _.drain
@Param(Array("10", "1000", "100000"))
var streamSize: Int = _
def fs2Stream: Stream[IO, Int] = Stream.range(1, streamSize).covary[IO] // Every number goes to it own chunk
@Benchmark
def noPipe(): Unit = execute(fs2Stream)
@Benchmark
def through(): Unit = execute(fs2Stream.through(drainPipe))
@Benchmark
def broadcastTo1(): Unit = execute(fs2Stream.broadcastThrough(drainPipe))
@Benchmark
def broadcastTo10(): Unit = execute(fs2Stream.broadcastThrough(Seq.fill(10)(drainPipe): _*))
@Benchmark
def broadcastTo100(): Unit = execute(fs2Stream.broadcastThrough(Seq.fill(100)(drainPipe): _*))
private def execute[O](s: Stream[IO, O]) = s.compile.drain.unsafeRunSync()
}
commit 8711623915614d1a789acf88c5e488c735cf32f7 (HEAD, origin/main)
Merge: a95ee4491 25a9cfa7f
Author: Lars Hupel <lars.hupel@mytum.de>
Date: Wed Mar 10 20:37:19 2021 +0100
benchmark/ jmh:run -i 5 -wi 3 -f1 -t1 fs2.benchmark.BroadcastBenchmark
[info] Benchmark (streamSize) Mode Cnt Score Error Units
[info] BroadcastBenchmark.broadcastTo1 10 thrpt 5 2417,671 ± 250,849 ops/s
[info] BroadcastBenchmark.broadcastTo1 1000 thrpt 5 65,955 ± 24,632 ops/s
[info] BroadcastBenchmark.broadcastTo1 100000 thrpt 5 0,809 ± 0,209 ops/s
[info] BroadcastBenchmark.broadcastTo10 10 thrpt 5 1040,351 ± 108,177 ops/s
[info] BroadcastBenchmark.broadcastTo10 1000 thrpt 5 26,999 ± 3,637 ops/s
[info] BroadcastBenchmark.broadcastTo10 100000 thrpt 5 0,272 ± 0,057 ops/s
[info] BroadcastBenchmark.broadcastTo100 10 thrpt 5 136,757 ± 21,407 ops/s
[info] BroadcastBenchmark.broadcastTo100 1000 thrpt 5 2,850 ± 0,404 ops/s
[info] BroadcastBenchmark.broadcastTo100 100000 thrpt 5 0,031 ± 0,003 ops/s
[info] BroadcastBenchmark.noPipe 10 thrpt 5 87944,581 ± 11867,569 ops/s
[info] BroadcastBenchmark.noPipe 1000 thrpt 5 10571,977 ± 778,623 ops/s
[info] BroadcastBenchmark.noPipe 100000 thrpt 5 118,984 ± 7,147 ops/s
[info] BroadcastBenchmark.through 10 thrpt 5 63908,399 ± 2748,942 ops/s
[info] BroadcastBenchmark.through 1000 thrpt 5 3379,372 ± 34,668 ops/s
[info] BroadcastBenchmark.through 100000 thrpt 5 37,355 ± 1,078 ops/s
Ration of throughputs BroadcastBenchmark.broadcastTo1
to BroadcastBenchmark.through
is now 1 : 25 :worried: :scream:
Hi team, is there an update about this, thanks.
I have retried the benchmarks today, on my local laptop, with current main
branch:
Benchmark (streamSize) Mode Cnt Score Error Units
broadcastTo1 10 thrpt 5 1610.128 ± 981.404 ops/s
broadcastTo1 1000 thrpt 5 69.026 ± 5.545 ops/s
broadcastTo1 100000 thrpt 5 0.564 ± 0.297 ops/s
broadcastTo10 10 thrpt 5 703.255 ± 106.253 ops/s
broadcastTo10 1000 thrpt 5 17.701 ± 3.800 ops/s
broadcastTo10 100000 thrpt 5 0.169 ± 0.032 ops/s
broadcastTo100 10 thrpt 5 99.769 ± 2.559 ops/s
broadcastTo100 1000 thrpt 5 1.997 ± 0.185 ops/s
broadcastTo100 100000 thrpt 5 0.019 ± 0.003 ops/s
noPipe 10 thrpt 5 84179.766 ± 1327.783 ops/s
noPipe 1000 thrpt 5 5933.968 ± 66.318 ops/s
noPipe 100000 thrpt 5 56.078 ± 0.651 ops/s
through 10 thrpt 5 54918.753 ± 255.205 ops/s
through 1000 thrpt 5 1702.249 ± 37.491 ops/s
through 100000 thrpt 5 15.439 ± 0.072 ops/s
There is still the performance difference between through
and broadcastTo1
.
Using
Stream#broadcastTo
with the high message rate source leads to increased CPU usage. Meanwhile, Akka shows 2-3x less CPU usage.I tried different JVMs: OpenJDK 11, OpenJDK 14, OpenJDK 14 OpenJ9. The CPU usage is more or less the same on every virtual machine.
Prerequisites
JVM: OpenJDK 64-Bit Server VM (11.0.4+11, mixed mode) fs2: 2.4.6 akka-streams: 2.6.6
Scenario 1. fs2. map. ~15% CPU usage
Scenario 2. fs2. Broadcast to 1 pipe. ~28% CPU usage
Scenario 3. fs2. Broadcast to 100 pipes. 50-80% CPU usage
Scenario 4. Akka. Broadcast to 100 pipes. ~28% CPU usage
Related issues: https://github.com/typelevel/fs2/issues/1406 https://github.com/typelevel/fs2/issues/1469