High CPU Usage with Audio playback

maxall41 commented 1 year ago

Songbird version: next branch and 0.3.2

Rust version (rustc -V): 1.69

Serenity/Twilight version: serenity: 0.11.5

Description: I'm getting really high (relatively) CPU usage when playing back audio using either the next branch and songbird 0.3.2 (I haven't tested any other versions). It seems to use about 4 - 10% extra CPU per concurrent stream.

Which is a bit of a problem because at that rate you may only get 10 - 20 concurrent streams. I created a flamegraph with cargo flamegraph for both the next branch and 0.3.2. It seems like a lot of the time is spent on opus encoding (Im using a WAV file as a source). But more generally that a lot of minor inefficiencies seem to be adding up to a larger issues with performance.

For a quick comparison with the same load lavalink peaks at around 3.3%.

I have attached the flamegraphs below. It does seem like the next branch version is somewhat better at CPU utilization but only by about 2%. I'm measuring CPU usage with the top command If that's important. Im using an AWS EC2 Instance with 8GB of memory and four Intel Cores.

Steps to reproduce:

Load serenity/voice example
Create three concurrent streams

NEXT: flamegraph-next-branch 0.3.2: flamegraph-0 3 2-songbird

FelixMcFelix commented 1 year ago

0.3.2 is not worth caring about; it inherited a legacy of using ffmpeg to process audio and then contacting it via IPC.

10% CPU per stream is... remarkable. As a sanity check, you are compiling with --release mode enabled? Playing WAV audio gets me CPU util closer to the below (Ryzen 5900X):

It seems like a lot of the time is spent on opus encoding (Im using a WAV file as a source).

This cannot be avoided unless you feed in an opus source -- libopus costs what it costs. Even so, wav is cheap:

Float Input (No Passthrough)/Single Packet/1
                        time:   [143.37 µs 143.80 µs 144.33 µs]
                        change: [-2.3016% -0.4741% +1.4039%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1
                        time:   [141.01 µs 141.39 µs 141.82 µs]
                        change: [-1.3875% +0.2363% +1.8364%] (p = 0.78 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

You could try running cargo make bench on the repo and seeing whether your numbers agree?

maxall41 commented 1 year ago

Yeah I made sure to build it with --release. Looking back at it seems like it's actually pretty much always around 3.3% per stream with peaks at 5.6% not 4 - 10 my bad. But that is still pretty high compared to Lavalink at about 1.1% per stream with 1.4% peaks on the same system.

Here is the full benchmark that i got on the EC2 Instance:

Stereo Target/Stereo Source/48000
                        time:   [7.2219 µs 7.2924 µs 7.3579 µs]
Stereo Target/Mono Source/48000
                        time:   [4.7554 µs 4.7973 µs 4.8365 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Stereo Target/Stereo Source (Resample)/44100
                        time:   [69.795 µs 70.010 µs 70.232 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Stereo Target/Mono Source (Resample)/44100
                        time:   [56.926 µs 57.132 µs 57.358 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

     Running benches/mixing-task.rs (target/release/deps/mixing_task-0044417f66554dc1)
Benchmarking Float Input (No Passthrough)/Single Packet/1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.5s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet/1
                        time:   [281.13 µs 283.09 µs 285.74 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/1
                        time:   [278.90 µs 280.67 µs 282.95 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/1
                        time:   [1.3171 ms 1.3208 ms 1.3248 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet/2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.8s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet/2
                        time:   [323.97 µs 324.99 µs 326.09 µs]
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, enable flat sampling, or reduce sample count to 60.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/2
                        time:   [289.49 µs 291.00 µs 293.21 µs]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
Float Input (No Passthrough)/n=5 Packets/2
                        time:   [1.4097 ms 1.4131 ms 1.4168 ms]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet/4: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet/4
                        time:   [336.73 µs 338.84 µs 341.63 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/4: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/4
                        time:   [310.19 µs 311.66 µs 313.16 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Float Input (No Passthrough)/n=5 Packets/4
                        time:   [1.4855 ms 1.4899 ms 1.4951 ms]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking Float Input (No Passthrough)/Single Packet/8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet/8
                        time:   [377.03 µs 378.22 µs 379.47 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
Benchmarking Float Input (No Passthrough)/Single Packet (No Soft-Clip)/8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/8
                        time:   [345.77 µs 347.69 µs 349.74 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/8
                        time:   [1.6381 ms 1.6426 ms 1.6477 ms]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
Float Input (No Passthrough)/Single Packet/16
                        time:   [452.67 µs 454.89 µs 458.44 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/16
                        time:   [411.25 µs 412.77 µs 414.39 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/16
                        time:   [1.9613 ms 1.9677 ms 1.9758 ms]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
Float Input (No Passthrough)/Single Packet/32
                        time:   [583.61 µs 586.02 µs 588.56 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/32
                        time:   [537.61 µs 540.04 µs 542.72 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/n=5 Packets/32
                        time:   [2.5651 ms 2.5766 ms 2.5886 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Float Input (No Passthrough)/Single Packet/64
                        time:   [801.72 µs 805.11 µs 809.87 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Float Input (No Passthrough)/Single Packet (No Soft-Clip)/64
                        time:   [835.46 µs 840.49 µs 845.65 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Float Input (No Passthrough)/n=5 Packets/64
                        time:   [3.8505 ms 3.8632 ms 3.8773 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Opus Input (Passthrough)/Single Packet
                        time:   [29.688 µs 30.144 µs 30.623 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Opus Input (Passthrough)/n=5 Packets
                        time:   [71.029 µs 71.678 µs 72.337 µs]

Worst-case Track Culling (15 tracks, 5 pkts)
                        time:   [1.3133 ms 1.3173 ms 1.3216 ms]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

maxall41 commented 1 year ago

PR #179 Does seem to improve this significantly with per stream CPU usage at 1.6% with 1.9% peaks. Which is much better but still 0.3% Per Stream off from Lavalink though that is probably good enough for most applications.

FelixMcFelix commented 1 year ago

Thanks for running the benchmarks locally, that's pretty disappointing (but I'm glad #179 has a big effect). I can think of some remaining places that are less than optimal (e.g., event thread wakeups regardless of handler registration). The only other thing that comes to mind is maybe the async->sync adapter adding more memcpys in the Http case. What I can recommend today is setting softclip to false for a marginal improvement -- it's enabled by default to prevent folks from blowing out someone's eardrums by naïvely playing too many audio sources at once.

There are one or two things in symphonia that trip us up here: because it works in planar audio, we're doing a lot of copying on WAV to achieve interleaved (wav) -> planar (mixing) -> interleaved (libopus). I don't think those can be fixed without rearchitecting.

maxall41 commented 1 year ago

Alright thanks for the advice.

serenity-rs / songbird

High CPU Usage with Audio playback #185