typelevel / cats-effect

The pure asynchronous runtime for Scala
https://typelevel.org/cats-effect/
Apache License 2.0
2.02k stars 516 forks source link

Dropping queue high memory usage #4141

Open iRevive opened 4 days ago

iRevive commented 4 days ago

I've been benchmarking otel4s BatchSpanProcessor.scala and noticed a heavy memory consumption.

One of the suspects is the dropping queue. Switching to Queue.bounded dropped memory usage by 30-40%.

I added dropping queue to the benchmarks and run a few:

Benchmark                                          (size)   Mode  Cnt         Score     Error   Units
boundedAsyncEnqueueDequeueMany                      32768  thrpt   10       304.622 ±   3.166   ops/s
boundedAsyncEnqueueDequeueMany:gc.alloc.rate        32768  thrpt   10      3922.005 ±  40.728  MB/sec
boundedAsyncEnqueueDequeueMany:gc.alloc.rate.norm   32768  thrpt   10  13501842.848 ±   2.361    B/op
boundedAsyncEnqueueDequeueMany:gc.count             32768  thrpt   10      1077.000            counts
boundedAsyncEnqueueDequeueMany:gc.time              32768  thrpt   10       632.000                ms
boundedAsyncEnqueueDequeueOne                       32768  thrpt   10       307.547 ±   3.854   ops/s
boundedAsyncEnqueueDequeueOne:gc.alloc.rate         32768  thrpt   10      3729.077 ±  46.477  MB/sec
boundedAsyncEnqueueDequeueOne:gc.alloc.rate.norm    32768  thrpt   10  12715370.817 ±   2.316    B/op
boundedAsyncEnqueueDequeueOne:gc.count              32768  thrpt   10      1256.000            counts
boundedAsyncEnqueueDequeueOne:gc.time               32768  thrpt   10       709.000                ms
droppingEnqueueDequeueMany                          32768  thrpt   10       149.328 ±   6.145   ops/s
droppingEnqueueDequeueMany:gc.alloc.rate            32768  thrpt   10      6047.413 ± 248.586  MB/sec
droppingEnqueueDequeueMany:gc.alloc.rate.norm       32768  thrpt   10  42468733.794 ±   4.730    B/op
droppingEnqueueDequeueMany:gc.count                 32768  thrpt   10       897.000            counts
droppingEnqueueDequeueMany:gc.time                  32768  thrpt   10      1296.000                ms
droppingEnqueueDequeueOne                           32768  thrpt   10       175.001 ±   6.565   ops/s
droppingEnqueueDequeueOne:gc.alloc.rate             32768  thrpt   10      6824.599 ± 256.766  MB/sec
droppingEnqueueDequeueOne:gc.alloc.rate.norm        32768  thrpt   10  40895829.028 ±   4.461    B/op
droppingEnqueueDequeueOne:gc.count                  32768  thrpt   10      1567.000            counts
droppingEnqueueDequeueOne:gc.time                   32768  thrpt   10       884.000                ms

The dropping queue's normalized (per operation) allocation is 3x more than the bounded queue.

JFR data

Bounded queue:

image

Dropping queue:

image

Affected queues

The dropping queue is backed by the AbstractQueue. The circular buffer and bounded queue (concurrent-based variant) also use AbstractQueue, so they should show heavy-memory usage symptoms, too.

iRevive commented 4 days ago

Side note: If I'm not mistaken, all queues (unbound, bound, dropping, etc) are multi-producer/multi-consumer. If so, is there any plan to provide other specialized variants, such as multi-producer/single-consumer?

armanbilge commented 4 days ago

The dropping queue's normalized (per operation) allocation is 3x more than the bounded queue.

The bounded queue uses Async to wrap an underlying unsafe queue, whereas the dropping queue is entirely implemented with Concurrent via Ref and immutable data structures. So some disparity is expected ...

armanbilge commented 4 days ago

If so, is there any plan to provide other specialized variants, such as multi-producer/single-consumer

This is an FS2 Channel :)

iRevive commented 4 days ago

The dropping queue's normalized (per operation) allocation is 3x more than the bounded queue.

The bounded queue uses Async to wrap an underlying unsafe queue, whereas the dropping queue is entirely implemented with Concurrent via Ref and immutable data structures. So some disparity is expected ...

So, the other way around is to implement async-variant, right?