smol-rs / concurrent-queue

Concurrent multi-producer multi-consumer queue
Apache License 2.0
254 stars 22 forks source link

Performance regression on create/drop between 1.1.2 and 2.2.2 #4

Closed jjl closed 4 years ago

jjl commented 4 years ago

It's now about 2.5x what it used to be, from ~200ns to ~525.

ghost commented 4 years ago

Which kind of queue - unbounded(), bounded(1), or bounded(n) where n > 1?

jjl commented 4 years ago

Unbounded, sorry.

ghost commented 4 years ago

Ah yes that makes sense! It's a tradeoff - making bounded(n) and unbounded() slower was necessary to make bounded(1) faster. Still, I think construction could be made very cheap if we used once_cell to lazily initialize stuff.

Does the cost of queue construction matter in your use case?

jjl commented 4 years ago

I don't know that it's the most important thing, but here are before and afters for backplane, and you can see it's a significant proportion of the overall runtime because i've managed to slim backplane down so much.

test create_destroy              ... bench:         520 ns/iter (+/- 42)
test device_monitor_drop         ... bench:       1,612 ns/iter (+/- 490)
test device_monitor_drop_notify  ... bench:       1,390 ns/iter (+/- 261)
test device_monitor_error_notify ... bench:       1,613 ns/iter (+/- 43)
test device_peer_drop_notify     ... bench:       1,651 ns/iter (+/- 121)
test device_peer_error_notify    ... bench:       1,885 ns/iter (+/- 218)
test line_monitor_drop           ... bench:       1,782 ns/iter (+/- 238)
test line_monitor_drop_notify    ... bench:       1,764 ns/iter (+/- 407)
test line_monitor_error_notify   ... bench:       1,813 ns/iter (+/- 203)
test line_peer_drop_notify       ... bench:       1,717 ns/iter (+/- 370)
test line_peer_error_notify      ... bench:       2,052 ns/iter (+/- 40)

test create_destroy              ... bench:         197 ns/iter (+/- 25)
test device_monitor_drop         ... bench:         543 ns/iter (+/- 48)
test device_monitor_drop_notify  ... bench:         735 ns/iter (+/- 47)
test device_monitor_error_notify ... bench:         754 ns/iter (+/- 98)
test device_peer_drop_notify     ... bench:         896 ns/iter (+/- 114)
test device_peer_error_notify    ... bench:         932 ns/iter (+/- 74)
test line_monitor_drop           ... bench:         791 ns/iter (+/- 114)
test line_monitor_drop_notify    ... bench:         955 ns/iter (+/- 135)
test line_monitor_error_notify   ... bench:         971 ns/iter (+/- 154)
test line_peer_drop_notify       ... bench:       1,065 ns/iter (+/- 45)
test line_peer_error_notify      ... bench:       1,175 ns/iter (+/- 162)

Maybe we could add a new type for bounded(1) (and possibly bounded(0)?)to the internal enum? It feels a bit ick, but we should be able to get the best of both worlds (I'm assuming the toolchain is smart enough to optimise through it).

Note that in my use case here, i'm only actually using it as MPSC. I wonder if there's some performance I'm leaving on the table as a result of that. Thoughts? I've got the start of some non-mpsc channels on the backburner also

Oh, these benchmarks create two queues by the way, so you can halve the numbers :)

jjl commented 4 years ago

i had a look and i see you actually did that. so what was it that caused the regression?

jjl commented 4 years ago

(literally the only meaningful change i can see is a box, and i don't think one box is causing this)

ghost commented 4 years ago

I think it’s the box :) Will take a closer look a bit later...

jjl commented 4 years ago

confirmed, it's the box. good god, how slow must the apple system allocator be? :)

ghost commented 4 years ago

Indeed. I got my first MacBook recently and this definitely goes into the bucket of surprises. I guess we should just start benchmarking with jemalloc.

Given that glibc’s allocator was not amazing until recently but has become pretty good now, I have some hopes that apple’s allocator will eventually catch up as well...

ghost commented 4 years ago

I think we can close this since there's not much to do about this issue.