[3.3.0] Performance regression

iRevive commented 2 years ago

This issue was originally created in fs2, so version numbers below refer to fs2.

I've observed a performance degradation in some scenarios on 3.2.3 version.

The throughput on small byte streams remained the same. The throughput on bigger byte streams is decreased roughly by ~20%.
The memory allocation decreased by 10-15% on bigger byte streams.

See the benchmarks below for more details.

Stream usage

The project utilizes a TCP socket from the fs2-io module. I cannot share many details due to NDA, but the generalized usage of the socket is the following one:

val streamDecoder: StreamDecoder[Structure] =
  StreamDecoder.many(StructureDecoder)

socket.reads
  .through(streamDecoder.toPipeByte)
  .evalMap(structure => queue.offer(Right(structure)))
  .compile
  .background

Consumed bytes per single invocation:

createOne: 167 bytes
returnRandomUUID: 135 bytes
return100Record: 2683 bytes

3.2.2

Operation average time:

Benchmark                           Mode  Cnt   Score        Error   Units
DriverBenchmark.createOne           avgt   25   4.725 ±      0.396   ms/op
DriverBenchmark.return100Records    avgt   25   8.613 ±      0.488   ms/op
DriverBenchmark.returnRandomUUID    avgt   25   3.542 ±      1.156   ms/op

Memory allocation:

DriverBenchmark.return100Records:·gc.alloc.rate                        201.323 ±     11.576  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm               3117144.635 ±  13036.746    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               199.083 ±     23.093  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm      3076695.652 ± 283692.848    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                    0.074 ±      0.113  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm            1266.657 ±   1981.461    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space             0.484 ±      0.444  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm     7614.493 ±   6998.312    B/op
DriverBenchmark.return100Records:·gc.count                              72.000               counts
DriverBenchmark.return100Records:·gc.time                              300.000                   ms

3.2.3

Operation average time:

Benchmark                           Mode  Cnt   Score        Error   Units
DriverBenchmark.createOne           avgt   25   4.862 ±      0.414   ms/op
DriverBenchmark.return100Records    avgt   25  11.008 ±      0.356   ms/op
DriverBenchmark.returnRandomUUID    avgt   25   3.068 ±      0.299   ms/op

Memory allocation:

DriverBenchmark.return100Records:·gc.alloc.rate                        169.961 ±      6.376  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm               3383113.516 ±  12432.017    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               175.288 ±     24.073  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm      3484911.125 ± 432781.648    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                    0.061 ±      0.075  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm            1224.015 ±   1529.533    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space             0.484 ±      0.566  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm     9489.291 ±  11071.370    B/op
DriverBenchmark.return100Records:·gc.count                              56.000               counts
DriverBenchmark.return100Records:·gc.time                              433.000                   ms

armanbilge commented 2 years ago

I wonder if this is related to https://github.com/typelevel/fs2/pull/2734. Would you mind trying with these versions? before: 3.2-111-594f8fe after: 3.2-124-832c161

iRevive commented 2 years ago

@armanbilge I was thinking about this change as well.
Sure, I will rerun benchmarks against these versions.

iRevive commented 2 years ago

@armanbilge not much of a difference 🤔

3.2-111-594f8fe

Benchmark                         Mode  Cnt        Score        Error   Units
DriverBenchmark.createOne         avgt   25        4.753 ±      0.383   ms/op
DriverBenchmark.return100Records  avgt   25       10.699 ±      0.411   ms/op
DriverBenchmark.returnRandomUUID  avgt   25        3.078 ±      0.283   ms/op

3.2-124-832c161

Benchmark                         Mode  Cnt        Score        Error   Units
DriverBenchmark.createOne         avgt   25        4.794 ±      0.494   ms/op
DriverBenchmark.return100Records  avgt   25       10.758 ±      0.408   ms/op
DriverBenchmark.returnRandomUUID  avgt   25        3.065 ±      0.281   ms/op

3.2.3

Benchmark                         Mode  Cnt        Score        Error   Units
DriverBenchmark.createOne         avgt   25        4.838 ±      0.420   ms/op
DriverBenchmark.return100Records  avgt   25       11.064 ±      0.435   ms/op
DriverBenchmark.returnRandomUUID  avgt   25        3.119 ±      0.326   ms/op

armanbilge commented 2 years ago

If you have bandwidth to bisect further, you can use the --first-parent option to get only merge commits and use the hash to look up the version in maven: https://repo1.maven.org/maven2/co/fs2/fs2-io_2.13/

iRevive commented 2 years ago

@armanbilge I will experiment with different versions during the day

armanbilge commented 2 years ago

Thanks!

mpilquist commented 2 years ago

Nothing really jumps out when looking at the diff between v3.2.2 and 594f8fe. Most likely culprits:

https://github.com/typelevel/fs2/pull/2720
Upgrade to CE 3.3

iRevive commented 2 years ago

Indeed, my project depends on cats-effect-std, hence IO comes from the fs2 package.

fs2 3.2.2 with cats-effect 3.3.0 shows the same throughput as 3.2.3.

Thank you for your help and sorry for wasting your time.

mpilquist commented 2 years ago

Not a waste of time at all, thanks for reporting! It would be good to understand what's causing the slow down in cats-effect. My guess is that this is the overhead of tracing but that's a total guess.

vasilmkd commented 2 years ago

Cats Effect 3.0.0 did not have tracing. It came with 3.2.0. The latest fs2 3.2.3 should be on CE 3.3.0.

mpilquist commented 2 years ago

Can you try the benchmark with the system property -Dcats.effect.tracing.mode=none? More details here: https://typelevel.org/cats-effect/docs/tracing#configuration

iRevive commented 2 years ago

I'm running benchmarks on a different machine so numbers differ from yesterday.

fs2 3.2.2

cats-effect 3.2.9 without tracing

Benchmark                                                             Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                      avgt   25        7.010 ±      0.337   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                       avgt   25      246.768 ±     12.250  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                  avgt   25  3068430.610 ±  10566.046    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space              avgt   25      245.038 ±     18.895  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm         avgt   25  3051556.639 ± 232239.895    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                 avgt   25        0.052 ±      0.101  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm            avgt   25      705.673 ±   1401.404    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space          avgt   25        0.419 ±      0.463  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm     avgt   25     5707.785 ±   6392.072    B/op
DriverBenchmark.return100Records:·gc.count                            avgt   25       80.000               counts
DriverBenchmark.return100Records:·gc.time                             avgt   25      275.000                   ms
DriverBenchmark.return100Records:·stack                               avgt               NaN                  ---

cats-effect 3.2.9 with tracing

Benchmark                                                               Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                        avgt   25        7.986 ±      0.315   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                         avgt   25      217.543 ±      9.826  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                    avgt   25  3097923.212 ±   9261.729    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space                avgt   25      215.997 ±     25.889  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm           avgt   25  3074210.574 ± 334508.436    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                   avgt   25        0.031 ±      0.037  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm              avgt   25      454.083 ±    576.467    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space            avgt   25        0.467 ±      0.715  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm       avgt   25     6855.137 ±  10494.727    B/op
DriverBenchmark.return100Records:·gc.count                              avgt   25       77.000               counts
DriverBenchmark.return100Records:·gc.time                               avgt   25      282.000                   ms

cats-effect 3.3.0 without tracing

Benchmark                                                              Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                       avgt   25        7.792 ±      0.323   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                        avgt   25      220.893 ±      9.084  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                   avgt   25  3065325.686 ±  21557.399    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               avgt   25      221.037 ±     22.768  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm          avgt   25  3065786.971 ± 270556.884    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                  avgt   25        0.063 ±      0.116  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm             avgt   25      943.848 ±   1749.708    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space           avgt   25        0.421 ±      0.573  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm      avgt   25     6075.865 ±   8203.016    B/op
DriverBenchmark.return100Records:·gc.count                             avgt   25       80.000               counts
DriverBenchmark.return100Records:·gc.time                              avgt   25      281.000                   ms

cats-effect 3.3.0 with tracing

Benchmark                                                              Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                       avgt   25       10.010 ±      0.504   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                        avgt   25      178.274 ±      9.522  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                   avgt   25  3185210.074 ±  13694.388    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               avgt   25      176.030 ±     26.087  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm          avgt   25  3143461.596 ± 418077.571    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                  avgt   25        0.051 ±      0.084  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm             avgt   25      910.347 ±   1460.952    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space           avgt   25        0.488 ±      0.541  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm      avgt   25     8868.587 ±   9979.185    B/op
DriverBenchmark.return100Records:·gc.count                             avgt   25       58.000               counts
DriverBenchmark.return100Records:·gc.time                              avgt   25      408.000                   ms
DriverBenchmark.return100Records:·stack                                avgt               NaN                  ---

fs2 3.2.3

cats-effect 3.2.9 without tracing

Benchmark                                                             Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                      avgt   25        7.325 ±      0.355   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                       avgt   25      249.123 ±     12.370  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                  avgt   25  3244986.275 ±  11358.509    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space              avgt   25      251.110 ±     25.120  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm         avgt   25  3270291.158 ± 288189.174    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                 avgt   25        0.036 ±      0.072  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm            avgt   25      513.287 ±   1066.717    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space          avgt   25        0.371 ±      0.447  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm     avgt   25     5275.698 ±   6492.410    B/op
DriverBenchmark.return100Records:·gc.count                            avgt   25       87.000               counts
DriverBenchmark.return100Records:·gc.time                             avgt   25      255.000                   ms

cats-effect 3.2.9 with tracing

Benchmark                                                              Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                       avgt   25        8.001 ±      0.343   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                        avgt   25      217.396 ±      9.389  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                   avgt   25  3197155.595 ±  16900.517    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               avgt   25      218.988 ±     24.045  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm          avgt   25  3362897.380 ± 395288.166    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                  avgt   25        0.027 ±      0.035  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm             avgt   25      437.723 ±    581.754    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space           avgt   25        0.467 ±      0.715  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm      avgt   25     7273.963 ±  11126.741    B/op
DriverBenchmark.return100Records:·gc.count                             avgt   25       77.000               counts
DriverBenchmark.return100Records:·gc.time                              avgt   25      299.000                   ms

cats-effect 3.3.0 without tracing

Benchmark                                                              Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                       avgt   25        7.662 ±      0.310   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                        avgt   25      237.331 ±     10.179  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                   avgt   25  3219401.063 ±  15108.885    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               avgt   25      241.757 ±     23.280  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm          avgt   25  3283962.366 ± 312667.926    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                  avgt   25        0.024 ±      0.030  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm             avgt   25      346.730 ±    441.142    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space           avgt   25        0.350 ±      0.505  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm      avgt   25     5005.121 ±   7250.275    B/op
DriverBenchmark.return100Records:·gc.count                             avgt   25       80.000               counts
DriverBenchmark.return100Records:·gc.time                              avgt   25      284.000                   ms

cats-effect 3.3.0 with tracing

Benchmark                                                              Mode  Cnt        Score        Error   Units
DriverBenchmark.return100Records                                       avgt   25       10.424 ±      0.391   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate                        avgt   25      182.919 ±      6.906  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm                   avgt   25  3410284.489 ±  39986.912    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space               avgt   25      182.465 ±     26.496  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm          avgt   25  3402766.732 ± 477752.507    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen                  avgt   25        0.143 ±      0.145  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm             avgt   25     2694.292 ±   2722.628    B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space           avgt   25        0.768 ±      0.501  MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm      avgt   25    14438.726 ±   9527.695    B/op
DriverBenchmark.return100Records:·gc.count                             avgt   25       61.000               counts
DriverBenchmark.return100Records:·gc.time                              avgt   25      409.000                   ms

armanbilge commented 2 years ago

Thanks for the update!

fs2 3.2.2 with cats-effect 3.3.0 shows the same throughput as 3.2.3.

~~Can you repeat this experiment with tracing disabled?~~

Sorry, actually no. I think you want to compare fs2 3.2.2 with CE 3.2.9 and CE 3.3.0, both with tracing disabled.

iRevive commented 2 years ago

@armanbilge I've updated the comment above

armanbilge commented 2 years ago

@iRevive Thanks! For the fs2 3.2.2 / CE 3.2.9, I assume that was with tracing? Can you run those with tracing disabled too? (same option)

iRevive commented 2 years ago

@armanbilge yeah, the difference is caused by tracing.

I've updated my comment with extra benchmarks. Seems I covered all possible combinations.

armanbilge commented 2 years ago

Thanks for all your efforts, and sorry for the confusion. I think these are the 4 combinations we need:

[ ] fs2 3.2.2 / CE 3.2.9 / tracing enabled
[ ] fs2 3.2.2 / CE 3.2.9 / tracing disabled
[x] fs2 3.2.2 / CE 3.3.0 / tracing enabled
[x] fs2 3.2.2 / CE 3.3.0 / tracing disabled

You've done one of the first two, but I'm not sure which (I assume with tracing enabled, because that's the default).

Sorry to be a nuisance. Running CE 3.2.9 with tracing disabled is helpful to check if the performance change was caused due to changes in tracing or changes in the runtime itself.

iRevive commented 2 years ago

@armanbilge done, see the updated comment https://github.com/typelevel/fs2/issues/2746#issuecomment-988800286

armanbilge commented 2 years ago

Awesome, thank you! 😁 let's have a look..

djspiewak commented 2 years ago

Looking at the time Delta, this seems about in line with what I would expect for tracing code which is mostly compute bound: about a 25% difference. Fully compute bound would be about 30%, more than likely.

The really interesting thing here is that GC time though. The tracing benchmark hit fewer GC iterations but still took almost twice as long. I wonder if that means that we could optimize tracing a bit further in practice by streamlining GC costs?

Edit: actually that shift only happened in 3.3.0, so it's almost certainly being caused by the weak bag shenanigans. We're forcing the GC to work harder in order to avoid bogging down the critical path.

nikiforo commented 2 years ago

                                                             Benchmark         3.2.9-        3.2.9+        3.3.0-        3.3.0+        3.2.9-        3.2.9+        3.3.0-        3.3.0+
                                      DriverBenchmark.return100Records          7.010         7.986         7.792        10.010         7.325         8.001         7.662        10.424
                       DriverBenchmark.return100Records:·gc.alloc.rate        246.768       217.543       220.893       178.274       249.123       217.396       237.331       182.919
                  DriverBenchmark.return100Records:·gc.alloc.rate.norm    3068430.610   3097923.212   3065325.686   3185210.074   3244986.275   3197155.595   3219401.063   3410284.489
              DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space        245.038       215.997       221.037       176.030       251.110       218.988       241.757       182.465
         DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm    3051556.639   3074210.574   3065786.971   3143461.596   3270291.158   3362897.380   3283962.366   3402766.732
                 DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen          0.052         0.031         0.063         0.051         0.036         0.027         0.024         0.143
            DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm        705.673       454.083       943.848       910.347       513.287       437.723       346.730      2694.292
          DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space          0.419         0.467         0.421         0.488         0.371         0.467         0.350         0.768
     DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm       5707.785      6855.137      6075.865      8868.587      5275.698      7273.963      5005.121     14438.726
                            DriverBenchmark.return100Records:·gc.count         80.000        77.000        80.000        58.000        87.000        77.000        80.000        61.000
                             DriverBenchmark.return100Records:·gc.time        275.000       282.000       281.000       408.000       255.000       299.000       284.000       409.000

djspiewak commented 2 years ago

Thanks to the tabular formation, it's easier to see that I'm probably wrong on the compute bound but. Notice how tracing enabled/disabled makes very little difference on 3.2.9, but it matters a lot on 3.3. That suggests the fiber dump mechanism is primarily at fault. Absolutely fascinating.

Can you open an issue on CE for this? I think this is a meaningful performance regression.

armanbilge commented 2 years ago

We can transfer this one actually.

vasilmkd commented 2 years ago

https://github.com/typelevel/cats-effect/pull/2368#issuecomment-925306033

rpiaggio commented 2 years ago

We seem to be hitting this issue particularly hard in Scala.js (dev environment), with a performance degradation by a whole order of magnitude. For example, an iteration over ~100 IO calls which took between 1 and 2 seconds in 3.2.9, now takes around 30 in 3.3.0.

So far, I've been unable to avoid the issue by setting process.env.CATS_EFFECT_TRACING_MODE="none". Even though it's set when the application starts (before invoking any Scala.js code), the performance degradation persists.

Any pointers for further exploring solutions would be welcome.

armanbilge commented 2 years ago

@rpiaggio thanks for reporting and sorry for your troubles. Yes, unfortunately tracing does have a far more severe impact on JS than JVM for reasons we are still trying to understand. This is why it is only enabled in development mode. Which browser are you using?

However, setting CATS_EFFECT_TRACING_MODE as you've done should for the most part disable it, I'm not sure why that's not working for you. I'm assuming this related to the explore project linked above, so I'll take a look there.

See also:

armanbilge commented 2 years ago

@rpiaggio one idea, to check that CATS_EFFECT_TRACING_MODE is working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.

rpiaggio commented 2 years ago

@armanbilge Thank you for the links.

Which browser are you using?

Chrome Version 96.0.4664.93 (Official Build) (x86_64)

I'll take a look there.

Thank you!

rpiaggio commented 2 years ago

@rpiaggio one idea, to check that CATS_EFFECT_TRACING_MODE is working—can you raise an exception from inside CE and report the stacktrace? If it's enhanced then that means tracing is still enabled.

Good idea. I tried your suggestion and I'm actually not getting enhanced stacktraces.

This is what I get when setting CATS_EFFECT_TRACING_MODE: "none":

vs what I get when not setting CATS_EFFECT_TRACING_MODE:

armanbilge commented 2 years ago

Aha, that is helpful. So now that we've confirmed tracing is off, you are definitely still seeing a significant performance regression? That is concerning!

vasilmkd commented 2 years ago

The FiberAwareEC is still being constructed in this case because presumably Chrome supports weak maps? If yes, then we have a lot of wrapping and unwrapping of Runnables and it needs to be measured.

I would be really surprised if that's a 15x regression though, that doesn't quite seem right.

armanbilge commented 2 years ago

Ah, @vasilmkd is right, my bad 😅 we need one more conditional there to check if tracing is disabled. Thanks, this is an easy fix!

djspiewak commented 2 years ago

So I think that we're probably putting just a bit more strain than we should be on the GC. This doesn't show as problematic in our benchmarks since we don't stress the heap too much, but in scenarios like the one in OP it becomes more problematic. A 20% hit for a feature which is intended to be lossless in terms of performance is a bit too much, IMO. At the very least it would merit allowing some more granular configurability, since clearly the scenario here is not sensitive to the performance hit of tracing itself, but is sensitive to the weak tracking.

I think a good action item here would be to try to replicate something like this in our own performance tests. If we can create a benchmark which reproduces the problem, we can try to iterate on the implementation more effectively. My guess is that we may have to back away from the weak thread-local striping, but maybe there's a way we can be creative to keep that optimization intact.

vasilmkd commented 2 years ago

We can always replace the WeakHashMap with your original idea, I don't think the thread locality is the problem.

WeakHashMap does allocate on every use and does a bit of management on every access (to remove expired keys) and both of these things are not something we ideally want. Another alternative is to develop our own weak hash map with features that are more desirable for Cats Effect, like fewer allocations, faster access.

djspiewak commented 2 years ago

@vasilmkd This would need some measurement, but if I correctly understand the way the GC works, WeakReference itself is more the problem. I actually wonder if we can batch the weak references to avoid having one per fiber.

I mostly brought up the thread locality because, if we don't leverage the GC to perform the cleanup, we have to find some other way to solve the problem of an IOCont resuming on a different thread than the one which suspended it.

vasilmkd commented 2 years ago

I don't see how WeakReferences can be avoided and/or not be 1:1 with what you're trying to hold in a data structure. If you don't use a WeakReference and you still hold a reference to an object, then that reference is strong and the object is not eligible for collection.

vasilmkd commented 2 years ago

@iRevive If you could try the following snapshot, that would be amazing. Thank you.

"io.vasilev" %% "cats-effect" % "3.3-87-1d1f0ff"

iRevive commented 2 years ago

That's what I got today:

                             3.3-87-1d1f0ff -     3.3-87-1d1f0ff +             3.3.0 -             3.3.0 +
throughput/avgt                         9.096               10.316               9.083              10.790      ms/op
gc.alloc.rate                         198.949              180.254             198.430             174.146     MB/sec
gc.alloc.rate.norm                3250096.228          3364384.042         3246550.722         3390473.917       B/op
gc.churn.G1_Eden_Space                198.509              183.132             203.094             177.976     MB/sec
gc.churn.G1_Eden_Space.norm       3250850.493          3415242.285         3321291.087         3474275.752       B/op
gc.churn.G1_Old_Gen                     0.152                0.074               0.121               0.079     MB/sec
gc.churn.G1_Old_Gen.norm             2612.132             1460.794            2061.829            1577.063       B/op
gc.churn.G1_Survivor_Space              0.599                0.620               0.486               0.555     MB/sec
gc.churn.G1_Survivor_Space.norm     10117.412            11649.115            7970.496           10718.606       B/op
gc.count                               68.000               66.000              67.000              62.000     counts
gc.time                               325.000              364.000             343.000             448.000         ms

vasilmkd commented 2 years ago

What about the throughput results? The GC time is lower with the fix.

iRevive commented 2 years ago

The 3.3-87-1d1f0f is roughly 5% faster than 3.3.0.

vasilmkd commented 2 years ago

How does that compare to 3.2.9? Is that still not sufficient?

iRevive commented 2 years ago

I did several measurements yesterday. I've increased the measurement time to decrease the margin of error for the score.

- - without tracing, + - with tracing:

                                 3.2.9 -           3.2.9 +           3.3.0 -            3.3.0 +     3.3-87-1d1f0ff -     3.3-87-1d1f0ff +
op average time (ms)               6.738             7.914             7.272             9.387                  7.220                9.191

gc.alloc.rate                    415.267           359.753           383.671           309.012                384.822              312.819
gc.alloc.rate.norm           3133369.005       3186737.717       3120760.158       3240605.996            3108259.130          3218716.403
gc.churn.G1_Eden_Space           419.880           363.964           387.659           313.066                388.866              315.969
gc.churn.G1_Eden_Space.norm  3168091.608       3224107.625       3153284.906       3283367.318            3140929.511          3251105.574
gc.churn.G1_Old_Gen                0.010             0.006             0.009             0.051                  0.012                0.009
gc.churn.G1_Old_Gen.norm          74.127            57.027            76.633           542.592                 95.361               89.211
gc.count                         779.000           675.000           668.000           486.000                721.000              586.000
gc.time                          648.000           575.000           578.000          1844.000                609.000              837.000

The GC time is lower with the fix. And the 3.3-87-1d1f0ff works slightly faster when tracing is enabled (~2-3%) compared to 3.3.0.

From what I see, enabled tracing affects throughput of the library by ~15% on 3.2.9 and ~22% on 3.3-87-1d1f0ff.

If we compare 3.2.9 with 3.3-87-1d1f0ff we have the following:

without tracing: 3.3-87-1d1f0ff is ~7% slower
with tracing: 3.3-87-1d1f0ff is ~14% slower

djspiewak commented 2 years ago

The newly-released 3.3.2 includes the fixes tested here. One thing I would really like to do is figure out why tracing disabled has more overhead in this release than in previous ones. We'll leave this issue open to track that problem.

Regarding the increase in overhead with tracing enabled, I think we can definitely continue to iterate on that case, but I expect that the scenario described in the OP is always going to be sub-optimal. From what I can tell, @iRevive is describing an application which evalMaps over a Stream[F, Byte], or rather something relatively close to it. The problem with these types of workloads is they innately produce an enormous amount of heap pressure, meaning they end up entirely GC-bound, which is why the weak bag overhead starts to dominate.

Cats Effect is fundamentally optimized around the following assumption: the number of IO instances constructed is O(kn), where n is the number of network/file I/O operations, and k is some (probably large, possibly amortized) constant factor. Running evalMap on a Stream[F, Byte] violates this assumption because evalMap must produce at least one (and actually many more than one) F instance per Byte, and thus k is proportional to the size of the stream and not itself a constant. On some level, Cats Effect (and indeed, all JVM frameworks, effect-system or otherwise) will always do quite poorly in these types of workloads.

Fwiw, the way to avoid this and recover the constant-factor k is to use .chunks (or similar) on the Stream and only allocate IOs around the chunk boundaries, rather than within them. Maintaining this pattern throughout your application cleanly separates your hot path (which will contain no IO allocations) and should dramatically improve performance with or without tracing.

Anyway, I'm making a few assumptions about the OP case, but I think I'm at least close to the mark. :-) The point I'm making is that we probably can't make OP's scenario fast, just because the JVM is… the JVM. What we can do though is fix the tracing-disabled performance regression (which is still open), and ensure that tracing-enabled performance overhead is kept to a minimum (which is what was done in 3.3.2).

iRevive commented 2 years ago

@djspiewak thank you for the details.

According to the StreamDecoder L56, socket.reads.through(streamDecoder.toPipeByte) should operate on a chunk rather than a single byte.

On the other hand, StreamDecoder reevaluates the decoding part if current bytes are not enough to parse the packet. I can assume it happens quite often, especially in the scenario I did a benchmark with.

I will experiment with the given suggestion:

the way to avoid this and recover the constant-factor k is to use .chunks (or similar) on the Stream and only allocate IOs around the chunk boundaries, rather than within them

Perhaps a custom accumulating stage (before decoding) will do the trick.

djspiewak commented 2 years ago

Perhaps a custom accumulating stage (before decoding) will do the trick.

I think that's a good idea! This should also result in less backtracking in scodec. It's tricky though because you have to try to predict the "right" chunk size. Hard to give good advice here without seeing your problem space, but accumulating into larger chunks is definitely what I would try, particularly if it can be done on the sending side rather than on the receiving side.

He-Pin commented 1 year ago

@iRevive Have you test the most recent release?

iRevive commented 1 year ago

@He-Pin not really, to be honest.

I guess I can run some benchmarks this weekend to see the difference.

iRevive commented 1 year ago

Quite a lot of things have changed: internal implementation and optimization, testing platform (apple intel -> apple m1), JVM (11 -> 17). But the pattern is still there.

Testing platform

VM version: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS Machine: Apple M1 Pro, 32 RAM OS: MacOS 13.5.1 (Ventura)

Benchmarks

CE 3.2.9, FS2 3.2.2, tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.108 ±     0.069   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      852.355 ±    21.231  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  3141439.919 ± 16541.680    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25      107.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25       66.000                  ms

CE 3.2.9, FS2 3.2.2, no tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.041 ±     0.139   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      869.532 ±    43.311  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  3135138.349 ± 17124.071    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25      100.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25       67.000                  ms

CE 3.3.0, FS2 3.2.2, tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.895 ±     0.334   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      689.632 ±    61.692  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  3196648.931 ± 16613.670    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25       54.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25      347.000                  ms

CE 3.3.0, FS2 3.2.2, no tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.062 ±     0.099   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      853.968 ±    29.860  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  3104378.022 ± 12574.350    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25      101.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25       67.000                  ms

CE 3.5.1, FS2 3.9.1, tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.377 ±     0.077   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      766.447 ±    18.633  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  3068305.840 ± 23767.470    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25       98.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25       61.000                  ms

CE 3.5.1, FS2 3.9.1, no tracing

Benchmark                                             Mode  Cnt        Score       Error   Units
DriverBenchmark.return100Records                      avgt   25        3.198 ±     0.315   ms/op
DriverBenchmark.return100Records:·gc.alloc.rate       avgt   25      795.719 ±    70.716  MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  2996199.095 ± 27658.703    B/op
DriverBenchmark.return100Records:·gc.count            avgt   25      104.000              counts
DriverBenchmark.return100Records:·gc.time             avgt   25       68.000                  ms

Conclusions

Tracing, in general, leads to worse performance and higher GC pressure across all versions (expected)
Tracing in 3.5.1 is less noticeable ~5.3%; vs ~15-21% in 3.3.0

Bonus point: comparison with Java driver

There is also an official driver, written in Java, that utilizes Netty under the hood. The numbers are the following:

Benchmark                                                 Mode  Cnt      Score       Error   Units
JavaDriverBenchmark.return100Records                      avgt   25      2.666 ±     0.532   ms/op
JavaDriverBenchmark.return100Records:·gc.alloc.rate       avgt   25     20.940 ±     7.373  MB/sec
JavaDriverBenchmark.return100Records:·gc.alloc.rate.norm  avgt   25  57620.287 ± 18067.576    B/op
JavaDriverBenchmark.return100Records:·gc.count            avgt   25        ≈ 0              counts

Scala driver is 13-18% slower, but the allocation rate is completely different. There are several reasons for that: Scala driver utilizes immutable structures, refined types, extra levels of abstraction, and the FS2 NIO socket layer; while Netty is doing Netty things :)

Also, a while ago, I was experimenting with the Netty socket layer instead of the FS2 NIO (mostly the code adjusted from https://github.com/typelevel/fs2-netty), and the allocation rate dropped by 2-2.5x. The allocation rate was never an issue for the project (at least for my team), so I didn't progress far with that.

armanbilge commented 1 year ago

@iRevive thanks for that update!

Tracing in 3.5.1 is less noticeable ~5.3%; vs ~15-21% in 3.3.0

Interesting, I'm not sure what's significantly changed between those releases besides the integrated timers.

Along those lines ...

Bonus point

If you have any extra bandwidth, it might be interesting to try out the JDK polling system (https://github.com/typelevel/cats-effect/issues/3692#issue-1757703692) and also the io_uring polling system (https://github.com/typelevel/cats-effect/issues/3692#issuecomment-1696481951)

Scala driver is 13-18% slower

The io_uring polling system gave Ember a significant speedup, would be cool if it makes a difference here.

djspiewak commented 10 months ago

This feels like it has essentially quiesced, so I'm going to close it. Please feel free to poke if you feel it should be reopened.

typelevel / cats-effect