Closed maxall41 closed 1 year ago
0.3.2 is not worth caring about; it inherited a legacy of using ffmpeg to process audio and then contacting it via IPC.
10% CPU per stream is... remarkable. As a sanity check, you are compiling with --release
mode enabled? Playing WAV audio gets me CPU util closer to the below (Ryzen 5900X):
It seems like a lot of the time is spent on opus encoding (Im using a WAV file as a source).
This cannot be avoided unless you feed in an opus source -- libopus costs what it costs. Even so, wav is cheap:
Float Input (No Passthrough)/Single Packet/1
time: [143.37 µs 143.80 µs 144.33 µs]
change: [-2.3016% -0.4741% +1.4039%] (p = 0.61 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1
time: [141.01 µs 141.39 µs 141.82 µs]
change: [-1.3875% +0.2363% +1.8364%] (p = 0.78 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
You could try running cargo make bench
on the repo and seeing whether your numbers agree?
Yeah I made sure to build it with --release
. Looking back at it seems like it's actually pretty much always around 3.3% per stream with peaks at 5.6% not 4 - 10 my bad. But that is still pretty high compared to Lavalink at about 1.1% per stream with 1.4% peaks on the same system.
Here is the full benchmark that i got on the EC2 Instance:
Stereo Target/Stereo Source/48000
time: [7.2219 µs 7.2924 µs 7.3579 µs]
Stereo Target/Mono Source/48000
time: [4.7554 µs 4.7973 µs 4.8365 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Stereo Target/Stereo Source (Resample)/44100
time: [69.795 µs 70.010 µs 70.232 µs]
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
Stereo Target/Mono Source (Resample)/44100
time: [56.926 µs 57.132 µs 57.358 µs]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
Running benches/mixing-task.rs (target/release/deps/mixing_task-0044417f66554dc1)
Benchmarking Float Input (No Passthrough)/Single Packet/1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.5s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet/1
time: [281.13 µs 283.09 µs 285.74 µs]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1
time: [278.90 µs 280.67 µs 282.95 µs]
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/1
time: [1.3171 ms 1.3208 ms 1.3248 ms]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet/2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet/2
time: [323.97 µs 324.99 µs 326.09 µs]
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/2
time: [289.49 µs 291.00 µs 293.21 µs]
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
Float Input (No Passthrough)/n=5 Packets/2
time: [1.4097 ms 1.4131 ms 1.4168 ms]
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet/4: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet/4
time: [336.73 µs 338.84 µs 341.63 µs]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/4: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/4
time: [310.19 µs 311.66 µs 313.16 µs]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
Float Input (No Passthrough)/n=5 Packets/4
time: [1.4855 ms 1.4899 ms 1.4951 ms]
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet/8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet/8
time: [377.03 µs 378.22 µs 379.47 µs]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/8
time: [345.77 µs 347.69 µs 349.74 µs]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/8
time: [1.6381 ms 1.6426 ms 1.6477 ms]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
Float Input (No Passthrough)/Single Packet/16
time: [452.67 µs 454.89 µs 458.44 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/16
time: [411.25 µs 412.77 µs 414.39 µs]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/16
time: [1.9613 ms 1.9677 ms 1.9758 ms]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
Float Input (No Passthrough)/Single Packet/32
time: [583.61 µs 586.02 µs 588.56 µs]
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/32
time: [537.61 µs 540.04 µs 542.72 µs]
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/32
time: [2.5651 ms 2.5766 ms 2.5886 ms]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Float Input (No Passthrough)/Single Packet/64
time: [801.72 µs 805.11 µs 809.87 µs]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/64
time: [835.46 µs 840.49 µs 845.65 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Float Input (No Passthrough)/n=5 Packets/64
time: [3.8505 ms 3.8632 ms 3.8773 ms]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Opus Input (Passthrough)/Single Packet
time: [29.688 µs 30.144 µs 30.623 µs]
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Opus Input (Passthrough)/n=5 Packets
time: [71.029 µs 71.678 µs 72.337 µs]
Worst-case Track Culling (15 tracks, 5 pkts)
time: [1.3133 ms 1.3173 ms 1.3216 ms]
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
PR #179 Does seem to improve this significantly with per stream CPU usage at 1.6% with 1.9% peaks. Which is much better but still 0.3% Per Stream off from Lavalink though that is probably good enough for most applications.
Thanks for running the benchmarks locally, that's pretty disappointing (but I'm glad #179 has a big effect). I can think of some remaining places that are less than optimal (e.g., event thread wakeups regardless of handler registration). The only other thing that comes to mind is maybe the async->sync adapter adding more memcpys in the Http case. What I can recommend today is setting softclip to false for a marginal improvement -- it's enabled by default to prevent folks from blowing out someone's eardrums by naïvely playing too many audio sources at once.
There are one or two things in symphonia that trip us up here: because it works in planar audio, we're doing a lot of copying on WAV to achieve interleaved (wav) -> planar (mixing) -> interleaved (libopus). I don't think those can be fixed without rearchitecting.
Alright thanks for the advice.
Songbird version:
next
branch and 0.3.2Rust version (
rustc -V
): 1.69Serenity/Twilight version: serenity: 0.11.5
Description: I'm getting really high (relatively) CPU usage when playing back audio using either the
next
branch and songbird 0.3.2 (I haven't tested any other versions). It seems to use about 4 - 10% extra CPU per concurrent stream.Which is a bit of a problem because at that rate you may only get 10 - 20 concurrent streams. I created a flamegraph with
cargo flamegraph
for both thenext
branch and 0.3.2. It seems like a lot of the time is spent on opus encoding (Im using a WAV file as a source). But more generally that a lot of minor inefficiencies seem to be adding up to a larger issues with performance.For a quick comparison with the same load lavalink peaks at around 3.3%.
I have attached the flamegraphs below. It does seem like the
next
branch version is somewhat better at CPU utilization but only by about 2%. I'm measuring CPU usage with thetop
command If that's important. Im using an AWS EC2 Instance with 8GB of memory and four Intel Cores.Steps to reproduce:
NEXT: 0.3.2: