Closed iRevive closed 10 months ago
I wonder if this is related to https://github.com/typelevel/fs2/pull/2734. Would you mind trying with these versions?
before: 3.2-111-594f8fe
after: 3.2-124-832c161
@armanbilge I was thinking about this change as well.
Sure, I will rerun benchmarks against these versions.
@armanbilge not much of a difference 🤔
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.753 ± 0.383 ms/op
DriverBenchmark.return100Records avgt 25 10.699 ± 0.411 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.078 ± 0.283 ms/op
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.794 ± 0.494 ms/op
DriverBenchmark.return100Records avgt 25 10.758 ± 0.408 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.065 ± 0.281 ms/op
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.838 ± 0.420 ms/op
DriverBenchmark.return100Records avgt 25 11.064 ± 0.435 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.119 ± 0.326 ms/op
If you have bandwidth to bisect further, you can use the --first-parent
option to get only merge commits and use the hash to look up the version in maven:
https://repo1.maven.org/maven2/co/fs2/fs2-io_2.13/
@armanbilge I will experiment with different versions during the day
Thanks!
Nothing really jumps out when looking at the diff between v3.2.2 and 594f8fe. Most likely culprits:
Indeed, my project depends on cats-effect-std
, hence IO
comes from the fs2
package.
fs2 3.2.2
with cats-effect 3.3.0 shows the same throughput as 3.2.3
.
Thank you for your help and sorry for wasting your time.
Not a waste of time at all, thanks for reporting! It would be good to understand what's causing the slow down in cats-effect. My guess is that this is the overhead of tracing but that's a total guess.
Cats Effect 3.0.0 did not have tracing. It came with 3.2.0. The latest fs2
3.2.3 should be on CE 3.3.0.
Can you try the benchmark with the system property -Dcats.effect.tracing.mode=none
? More details here: https://typelevel.org/cats-effect/docs/tracing#configuration
I'm running benchmarks on a different machine so numbers differ from yesterday.
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.010 ± 0.337 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 246.768 ± 12.250 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3068430.610 ± 10566.046 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 245.038 ± 18.895 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3051556.639 ± 232239.895 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.052 ± 0.101 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 705.673 ± 1401.404 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.419 ± 0.463 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5707.785 ± 6392.072 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 275.000 ms
DriverBenchmark.return100Records:·stack avgt NaN ---
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.986 ± 0.315 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 217.543 ± 9.826 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3097923.212 ± 9261.729 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 215.997 ± 25.889 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3074210.574 ± 334508.436 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.031 ± 0.037 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 454.083 ± 576.467 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.467 ± 0.715 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 6855.137 ± 10494.727 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 77.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 282.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.792 ± 0.323 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 220.893 ± 9.084 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3065325.686 ± 21557.399 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 221.037 ± 22.768 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3065786.971 ± 270556.884 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.063 ± 0.116 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 943.848 ± 1749.708 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.421 ± 0.573 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 6075.865 ± 8203.016 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 281.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 10.010 ± 0.504 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 178.274 ± 9.522 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3185210.074 ± 13694.388 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 176.030 ± 26.087 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3143461.596 ± 418077.571 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.051 ± 0.084 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 910.347 ± 1460.952 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.488 ± 0.541 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 8868.587 ± 9979.185 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 58.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 408.000 ms
DriverBenchmark.return100Records:·stack avgt NaN ---
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.325 ± 0.355 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 249.123 ± 12.370 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3244986.275 ± 11358.509 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 251.110 ± 25.120 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3270291.158 ± 288189.174 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.036 ± 0.072 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 513.287 ± 1066.717 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.371 ± 0.447 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5275.698 ± 6492.410 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 87.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 255.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 8.001 ± 0.343 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 217.396 ± 9.389 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3197155.595 ± 16900.517 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 218.988 ± 24.045 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3362897.380 ± 395288.166 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.027 ± 0.035 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 437.723 ± 581.754 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.467 ± 0.715 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 7273.963 ± 11126.741 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 77.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 299.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 7.662 ± 0.310 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 237.331 ± 10.179 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3219401.063 ± 15108.885 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 241.757 ± 23.280 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3283962.366 ± 312667.926 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.024 ± 0.030 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 346.730 ± 441.142 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.350 ± 0.505 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 5005.121 ± 7250.275 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 80.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 284.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 10.424 ± 0.391 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 182.919 ± 6.906 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3410284.489 ± 39986.912 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space avgt 25 182.465 ± 26.496 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm avgt 25 3402766.732 ± 477752.507 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen avgt 25 0.143 ± 0.145 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm avgt 25 2694.292 ± 2722.628 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space avgt 25 0.768 ± 0.501 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm avgt 25 14438.726 ± 9527.695 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 61.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 409.000 ms
Thanks for the update!
fs2 3.2.2 with cats-effect 3.3.0 shows the same throughput as 3.2.3.
Can you repeat this experiment with tracing disabled?
Sorry, actually no. I think you want to compare fs2 3.2.2 with CE 3.2.9 and CE 3.3.0, both with tracing disabled.
@armanbilge I've updated the comment above
@iRevive Thanks! For the fs2 3.2.2 / CE 3.2.9, I assume that was with tracing? Can you run those with tracing disabled too? (same option)
@armanbilge yeah, the difference is caused by tracing.
I've updated my comment with extra benchmarks. Seems I covered all possible combinations.
Thanks for all your efforts, and sorry for the confusion. I think these are the 4 combinations we need:
You've done one of the first two, but I'm not sure which (I assume with tracing enabled, because that's the default).
Sorry to be a nuisance. Running CE 3.2.9 with tracing disabled is helpful to check if the performance change was caused due to changes in tracing or changes in the runtime itself.
@armanbilge done, see the updated comment https://github.com/typelevel/fs2/issues/2746#issuecomment-988800286
Awesome, thank you! 😁 let's have a look..
Looking at the time Delta, this seems about in line with what I would expect for tracing code which is mostly compute bound: about a 25% difference. Fully compute bound would be about 30%, more than likely.
The really interesting thing here is that GC time though. The tracing benchmark hit fewer GC iterations but still took almost twice as long. I wonder if that means that we could optimize tracing a bit further in practice by streamlining GC costs?
Edit: actually that shift only happened in 3.3.0, so it's almost certainly being caused by the weak bag shenanigans. We're forcing the GC to work harder in order to avoid bogging down the critical path.
Benchmark 3.2.9- 3.2.9+ 3.3.0- 3.3.0+ 3.2.9- 3.2.9+ 3.3.0- 3.3.0+
DriverBenchmark.return100Records 7.010 7.986 7.792 10.010 7.325 8.001 7.662 10.424
DriverBenchmark.return100Records:·gc.alloc.rate 246.768 217.543 220.893 178.274 249.123 217.396 237.331 182.919
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3068430.610 3097923.212 3065325.686 3185210.074 3244986.275 3197155.595 3219401.063 3410284.489
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 245.038 215.997 221.037 176.030 251.110 218.988 241.757 182.465
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3051556.639 3074210.574 3065786.971 3143461.596 3270291.158 3362897.380 3283962.366 3402766.732
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.052 0.031 0.063 0.051 0.036 0.027 0.024 0.143
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 705.673 454.083 943.848 910.347 513.287 437.723 346.730 2694.292
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.419 0.467 0.421 0.488 0.371 0.467 0.350 0.768
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 5707.785 6855.137 6075.865 8868.587 5275.698 7273.963 5005.121 14438.726
DriverBenchmark.return100Records:·gc.count 80.000 77.000 80.000 58.000 87.000 77.000 80.000 61.000
DriverBenchmark.return100Records:·gc.time 275.000 282.000 281.000 408.000 255.000 299.000 284.000 409.000
Thanks to the tabular formation, it's easier to see that I'm probably wrong on the compute bound but. Notice how tracing enabled/disabled makes very little difference on 3.2.9, but it matters a lot on 3.3. That suggests the fiber dump mechanism is primarily at fault. Absolutely fascinating.
Can you open an issue on CE for this? I think this is a meaningful performance regression.
We can transfer this one actually.
We seem to be hitting this issue particularly hard in Scala.js (dev environment), with a performance degradation by a whole order of magnitude. For example, an iteration over ~100 IO calls which took between 1 and 2 seconds in 3.2.9
, now takes around 30 in 3.3.0
.
So far, I've been unable to avoid the issue by setting process.env.CATS_EFFECT_TRACING_MODE="none"
. Even though it's set when the application starts (before invoking any Scala.js code), the performance degradation persists.
Any pointers for further exploring solutions would be welcome.
@rpiaggio thanks for reporting and sorry for your troubles. Yes, unfortunately tracing does have a far more severe impact on JS than JVM for reasons we are still trying to understand. This is why it is only enabled in development mode. Which browser are you using?
However, setting CATS_EFFECT_TRACING_MODE
as you've done should for the most part disable it, I'm not sure why that's not working for you. I'm assuming this related to the explore project linked above, so I'll take a look there.
See also:
@rpiaggio one idea, to check that CATS_EFFECT_TRACING_MODE
is working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.
@armanbilge Thank you for the links.
Which browser are you using?
Chrome Version 96.0.4664.93 (Official Build) (x86_64)
I'll take a look there.
Thank you!
@rpiaggio one idea, to check that
CATS_EFFECT_TRACING_MODE
is working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.
Good idea. I tried your suggestion and I'm actually not getting enhanced stacktraces.
This is what I get when setting CATS_EFFECT_TRACING_MODE: "none"
:
vs what I get when not setting CATS_EFFECT_TRACING_MODE
:
Aha, that is helpful. So now that we've confirmed tracing is off, you are definitely still seeing a significant performance regression? That is concerning!
The FiberAwareEC
is still being constructed in this case because presumably Chrome supports weak maps? If yes, then we have a lot of wrapping and unwrapping of Runnables
and it needs to be measured.
I would be really surprised if that's a 15x regression though, that doesn't quite seem right.
Ah, @vasilmkd is right, my bad 😅 we need one more conditional there to check if tracing is disabled. Thanks, this is an easy fix!
So I think that we're probably putting just a bit more strain than we should be on the GC. This doesn't show as problematic in our benchmarks since we don't stress the heap too much, but in scenarios like the one in OP it becomes more problematic. A 20% hit for a feature which is intended to be lossless in terms of performance is a bit too much, IMO. At the very least it would merit allowing some more granular configurability, since clearly the scenario here is not sensitive to the performance hit of tracing itself, but is sensitive to the weak tracking.
I think a good action item here would be to try to replicate something like this in our own performance tests. If we can create a benchmark which reproduces the problem, we can try to iterate on the implementation more effectively. My guess is that we may have to back away from the weak thread-local striping, but maybe there's a way we can be creative to keep that optimization intact.
We can always replace the WeakHashMap
with your original idea, I don't think the thread locality is the problem.
WeakHashMap
does allocate on every use and does a bit of management on every access (to remove expired keys) and both of these things are not something we ideally want. Another alternative is to develop our own weak hash map with features that are more desirable for Cats Effect, like fewer allocations, faster access.
@vasilmkd This would need some measurement, but if I correctly understand the way the GC works, WeakReference
itself is more the problem. I actually wonder if we can batch the weak references to avoid having one per fiber.
I mostly brought up the thread locality because, if we don't leverage the GC to perform the cleanup, we have to find some other way to solve the problem of an IOCont
resuming on a different thread than the one which suspended it.
I don't see how WeakReferences
can be avoided and/or not be 1:1 with what you're trying to hold in a data structure. If you don't use a WeakReference
and you still hold a reference to an object, then that reference is strong and the object is not eligible for collection.
@iRevive If you could try the following snapshot, that would be amazing. Thank you.
"io.vasilev" %% "cats-effect" % "3.3-87-1d1f0ff"
That's what I got today:
3.3-87-1d1f0ff - 3.3-87-1d1f0ff + 3.3.0 - 3.3.0 +
throughput/avgt 9.096 10.316 9.083 10.790 ms/op
gc.alloc.rate 198.949 180.254 198.430 174.146 MB/sec
gc.alloc.rate.norm 3250096.228 3364384.042 3246550.722 3390473.917 B/op
gc.churn.G1_Eden_Space 198.509 183.132 203.094 177.976 MB/sec
gc.churn.G1_Eden_Space.norm 3250850.493 3415242.285 3321291.087 3474275.752 B/op
gc.churn.G1_Old_Gen 0.152 0.074 0.121 0.079 MB/sec
gc.churn.G1_Old_Gen.norm 2612.132 1460.794 2061.829 1577.063 B/op
gc.churn.G1_Survivor_Space 0.599 0.620 0.486 0.555 MB/sec
gc.churn.G1_Survivor_Space.norm 10117.412 11649.115 7970.496 10718.606 B/op
gc.count 68.000 66.000 67.000 62.000 counts
gc.time 325.000 364.000 343.000 448.000 ms
What about the throughput results? The GC time is lower with the fix.
The 3.3-87-1d1f0f
is roughly 5% faster than 3.3.0
.
How does that compare to 3.2.9
? Is that still not sufficient?
I did several measurements yesterday. I've increased the measurement time to decrease the margin of error for the score.
-
- without tracing, +
- with tracing:
3.2.9 - 3.2.9 + 3.3.0 - 3.3.0 + 3.3-87-1d1f0ff - 3.3-87-1d1f0ff +
op average time (ms) 6.738 7.914 7.272 9.387 7.220 9.191
gc.alloc.rate 415.267 359.753 383.671 309.012 384.822 312.819
gc.alloc.rate.norm 3133369.005 3186737.717 3120760.158 3240605.996 3108259.130 3218716.403
gc.churn.G1_Eden_Space 419.880 363.964 387.659 313.066 388.866 315.969
gc.churn.G1_Eden_Space.norm 3168091.608 3224107.625 3153284.906 3283367.318 3140929.511 3251105.574
gc.churn.G1_Old_Gen 0.010 0.006 0.009 0.051 0.012 0.009
gc.churn.G1_Old_Gen.norm 74.127 57.027 76.633 542.592 95.361 89.211
gc.count 779.000 675.000 668.000 486.000 721.000 586.000
gc.time 648.000 575.000 578.000 1844.000 609.000 837.000
The GC time is lower with the fix. And the 3.3-87-1d1f0ff
works slightly faster when tracing is enabled (~2-3%) compared to 3.3.0
.
From what I see, enabled tracing affects throughput of the library by ~15% on 3.2.9
and ~22% on 3.3-87-1d1f0ff
.
If we compare 3.2.9
with 3.3-87-1d1f0ff
we have the following:
3.3-87-1d1f0ff
is ~7% slower3.3-87-1d1f0ff
is ~14% slowerThe newly-released 3.3.2 includes the fixes tested here. One thing I would really like to do is figure out why tracing disabled has more overhead in this release than in previous ones. We'll leave this issue open to track that problem.
Regarding the increase in overhead with tracing enabled, I think we can definitely continue to iterate on that case, but I expect that the scenario described in the OP is always going to be sub-optimal. From what I can tell, @iRevive is describing an application which evalMap
s over a Stream[F, Byte]
, or rather something relatively close to it. The problem with these types of workloads is they innately produce an enormous amount of heap pressure, meaning they end up entirely GC-bound, which is why the weak bag overhead starts to dominate.
Cats Effect is fundamentally optimized around the following assumption: the number of IO
instances constructed is O(kn), where n is the number of network/file I/O operations, and k is some (probably large, possibly amortized) constant factor. Running evalMap
on a Stream[F, Byte]
violates this assumption because evalMap
must produce at least one (and actually many more than one) F
instance per Byte
, and thus k is proportional to the size of the stream and not itself a constant. On some level, Cats Effect (and indeed, all JVM frameworks, effect-system or otherwise) will always do quite poorly in these types of workloads.
Fwiw, the way to avoid this and recover the constant-factor k is to use .chunks
(or similar) on the Stream
and only allocate IO
s around the chunk boundaries, rather than within them. Maintaining this pattern throughout your application cleanly separates your hot path (which will contain no IO
allocations) and should dramatically improve performance with or without tracing.
Anyway, I'm making a few assumptions about the OP case, but I think I'm at least close to the mark. :-) The point I'm making is that we probably can't make OP's scenario fast, just because the JVM is… the JVM. What we can do though is fix the tracing-disabled performance regression (which is still open), and ensure that tracing-enabled performance overhead is kept to a minimum (which is what was done in 3.3.2).
@djspiewak thank you for the details.
According to the StreamDecoder L56, socket.reads.through(streamDecoder.toPipeByte)
should operate on a chunk rather than a single byte.
On the other hand, StreamDecoder reevaluates the decoding part if current bytes are not enough to parse the packet. I can assume it happens quite often, especially in the scenario I did a benchmark with.
I will experiment with the given suggestion:
the way to avoid this and recover the constant-factor k is to use .chunks (or similar) on the Stream and only allocate IOs around the chunk boundaries, rather than within them
Perhaps a custom accumulating stage (before decoding) will do the trick.
Perhaps a custom accumulating stage (before decoding) will do the trick.
I think that's a good idea! This should also result in less backtracking in scodec. It's tricky though because you have to try to predict the "right" chunk size. Hard to give good advice here without seeing your problem space, but accumulating into larger chunks is definitely what I would try, particularly if it can be done on the sending side rather than on the receiving side.
@iRevive Have you test the most recent release?
@He-Pin not really, to be honest.
I guess I can run some benchmarks this weekend to see the difference.
Quite a lot of things have changed: internal implementation and optimization, testing platform (apple intel -> apple m1), JVM (11 -> 17). But the pattern is still there.
VM version: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS Machine: Apple M1 Pro, 32 RAM OS: MacOS 13.5.1 (Ventura)
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.108 ± 0.069 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 852.355 ± 21.231 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3141439.919 ± 16541.680 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 107.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 66.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.041 ± 0.139 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 869.532 ± 43.311 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3135138.349 ± 17124.071 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 100.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 67.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.895 ± 0.334 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 689.632 ± 61.692 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3196648.931 ± 16613.670 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 54.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 347.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.062 ± 0.099 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 853.968 ± 29.860 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3104378.022 ± 12574.350 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 101.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 67.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.377 ± 0.077 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 766.447 ± 18.633 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 3068305.840 ± 23767.470 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 98.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 61.000 ms
Benchmark Mode Cnt Score Error Units
DriverBenchmark.return100Records avgt 25 3.198 ± 0.315 ms/op
DriverBenchmark.return100Records:·gc.alloc.rate avgt 25 795.719 ± 70.716 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 2996199.095 ± 27658.703 B/op
DriverBenchmark.return100Records:·gc.count avgt 25 104.000 counts
DriverBenchmark.return100Records:·gc.time avgt 25 68.000 ms
3.5.1
is less noticeable ~5.3%; vs ~15-21% in 3.3.0
There is also an official driver, written in Java, that utilizes Netty under the hood. The numbers are the following:
Benchmark Mode Cnt Score Error Units
JavaDriverBenchmark.return100Records avgt 25 2.666 ± 0.532 ms/op
JavaDriverBenchmark.return100Records:·gc.alloc.rate avgt 25 20.940 ± 7.373 MB/sec
JavaDriverBenchmark.return100Records:·gc.alloc.rate.norm avgt 25 57620.287 ± 18067.576 B/op
JavaDriverBenchmark.return100Records:·gc.count avgt 25 ≈ 0 counts
Scala driver is 13-18% slower, but the allocation rate is completely different. There are several reasons for that: Scala driver utilizes immutable structures, refined types, extra levels of abstraction, and the FS2 NIO socket layer; while Netty is doing Netty things :)
Also, a while ago, I was experimenting with the Netty socket layer instead of the FS2 NIO (mostly the code adjusted from https://github.com/typelevel/fs2-netty), and the allocation rate dropped by 2-2.5x. The allocation rate was never an issue for the project (at least for my team), so I didn't progress far with that.
@iRevive thanks for that update!
- Tracing in
3.5.1
is less noticeable ~5.3%; vs ~15-21% in3.3.0
Interesting, I'm not sure what's significantly changed between those releases besides the integrated timers.
Along those lines ...
Bonus point
If you have any extra bandwidth, it might be interesting to try out the JDK polling system (https://github.com/typelevel/cats-effect/issues/3692#issue-1757703692) and also the io_uring polling system (https://github.com/typelevel/cats-effect/issues/3692#issuecomment-1696481951)
Scala driver is 13-18% slower
The io_uring polling system gave Ember a significant speedup, would be cool if it makes a difference here.
This feels like it has essentially quiesced, so I'm going to close it. Please feel free to poke if you feel it should be reopened.
This issue was originally created in fs2, so version numbers below refer to fs2.
I've observed a performance degradation in some scenarios on 3.2.3 version.
The throughput on small byte streams remained the same. The throughput on bigger byte streams is decreased roughly by ~20%.
The memory allocation decreased by 10-15% on bigger byte streams.
See the benchmarks below for more details.
Stream usage
The project utilizes a TCP socket from the fs2-io module. I cannot share many details due to NDA, but the generalized usage of the socket is the following one:
Consumed bytes per single invocation:
3.2.2
Operation average time:
Memory allocation:
3.2.3
Operation average time:
Memory allocation: