sending samples 1 by 1 consumes lots of cpu and smaller bugs - improvement included

shbgr commented 5 months ago

Hi. I did some experiments with the lib. Works great, but i was wondering about only sending chunks of 4 bytes only to the i2s-driver. My C-experiences are cumbersome and the same with GIT. But managed to get CPU-load down by a factor of 2 by sending 16 i2s-frames instead of the the current single frame in the lib. This is likely due to several checks and creating a semaphore in IDF-lib for each call to i2s_write.

My results on Core 1 for a 320kbit MP3, stereo, 48k sample rate, SD connected with SPI:

single frame send: 65% load (lib 3.0.8)
16 frame send: 32% load

I have attached my modified lib (based on 3.0.8). Seems to run fine, but memory leaks and other codecs than mp3 were off my scope. So better double check. See also some notes incorporated on filters that might get wrong and playSample for 8-bit files. And i added an additional attenuation. Not relevant for the lib.

Maybe this helps for further improvement. Audiosh.zip

schreibfaul1 commented 5 months ago

Thanks for the input, I know that the current state is not ideal. I'll have a look at it in the next few days.

shbgr commented 5 months ago

Looks like you give it a try. Definitely needs some work for production use, but likely worth the trouble.

Some more thoughts:

I had disabled the sending by bluetoth hook (audio_process_i2s). Wouldn't work with 16 frame chunks.
For a more general solution maybe setup an additional "i2s-buffer" with size of the i2s chunk (with my settings 16 frames = 64 bytes) in playSample. And push out to i2s-DMA-Buffer as soon as full. Needs another uint8_t to track the fill state.

Therefor:

playChunk needs a spitOut (somewhat like my playPreparedSamples) at the beginning
playSample always writes to "i2s-buffer" and calls "spitOut" if full
filtering and audio_process_i2s in playSample work on the "i2s-buffer" instead of sending directly and remain unchanged
no samples processed in playChunk if "i2s-buffer" is full; this would adress the filter issue with discarded samples as well
spitOut (only sends if full) would be called from playSample if full and from playChunk at the beginning.
clear the "i2s-buffer" when playback is stopped or a new file starts; its less than 1ms, so not important for resume or anything, just avoid glitches

But have a look at the impact first. Also had some changes regarding the end of the file (idv3.1) and 8-bit-playback, but thats minor.

schreibfaul1 commented 5 months ago

Yes, I have copied the essentials from your source code and integrated them into a branch. Now the test phase begins.

tueddy commented 5 months ago

I have just tested the branch with ESPuino. The CPU time consumed is significantly reduced with these changes!

Current master:

Before

send-I2S-multiple-samples branch:

After

This is a great improvement! Audio plays fine. Unfortunately, Bluetooth Source (audio_process_i2s) is not yet working...

schreibfaul1 commented 5 months ago

Hi tueddy, you're already further along than me. So far I've only put the suggestions from shbgr into a branch and tested whether it works, purely subjectively. The measurements look really good, that brings a lot of speed. Maybe it's possible to put the filters and volume before the MULTISAMPLE array, then the BT function would be preserved.

tueddy commented 5 months ago

I know this is in an early state, I just wanted to share the performance measurement..

shbgr commented 5 months ago

Great. I think you could even reduce the mp3play load if you play with the delays in the task. I had 15 or 20ms if the decoded mp3 buffer (valid samples) couldn’t be completely written to DMA. 1ms otherwise. I monitored this with the audio_sent callback. But definitely has headroom.

The 16*512 buffer holds more than 40ms of samples worst case (48k). Decode for 1152 samples (9 buffers) takes about 5ms@320kbps. If there are valid samples left we should have at least 14 full buffers at the end of the loop (37ms worst case).

Some adaptive delay could be useful here. no idea how much load other codecs than mp3 produce. But for mp3 something like 2/5/7/9 ms delay for each 128 frames in buffer should work well. No need to try when not much work is expected…

Could give the headroom to lower cpu speed or simply allow faster ftp transfer while playing.

schreibfaul1 commented 5 months ago

Previously, the system waited until all decoded data had been written to the DMA buffer. There are now two improvements:

The transfer of the data as a block suggested by @shbgr
if not all the data fits into the DMA buffer, the task is terminated and the remaining data is written at the next call. This avoids waiting times.

It is no longer possible to transfer the samples individually to BT. This is only possible as a block.

@tueddy please check whether this is also possible in ESPuino:

void audio_process_i2s(int16_t* outBuff, uint16_t validSamples, uint8_t bitsPerSample, uint8_t channels, bool *continueI2S){

  log_i("validSamples %i, bitsPerSample %i, channels %i", validSamples, bitsPerSample, channels);

  for(int i = 0; i < validSamples; i++){
      if(channels == 2){
          // outBuff[i]      left channel
          // outBuff[i + 1]  right channel
      }
      if(channels == 1){
          // outbuff[i]  mono
      }
  }

  *continueI2S = true; // also output via I2S
}

tueddy commented 5 months ago

It should definitely be possible to change the audio_process_i2s function for the ESPuino project. Perhaps the Bluetooth Source process could also be optimized by processing several samples. I will test it the next days.

One question: The transferred samples remain the same as the previous version (volume applied etc.)?

schreibfaul1 commented 5 months ago

That looks good. If you can copy a block to BT instead of individual samples, this will save time again. audio_process_i2s is behind the equalizer and the volume control, so that won't be a problem. It would be nice if the BT function could remain in operation. In addition, with a sample rate of 22.1KHz you may be able to write each sample twice or ignore some samples at 48KHz.

tueddy commented 5 months ago

I made a quick test with Bluetooth Source and get basically the same samples as before. BT-Source seems to work fine so far.

Before:

// process audio sample extern (for bluetooth source)
void audio_process_i2s(uint32_t *sample, bool *continueI2S) {
    *continueI2S = !Bluetooth_Source_SendAudioData(sample);
}

New, without any optimization, just tested with channels=2, left/right might be swapped:

void audio_process_i2s(int16_t *outBuff, uint16_t validSamples, uint8_t bitsPerSample, uint8_t channels, bool *continueI2S) {
    Serial.printf("validSamples %i, bitsPerSample %i, channels %i\n", validSamples, bitsPerSample, channels);
    uint32_t sample;
    for (int i = 0; i < validSamples; i++) {
        if (channels == 2) {
            // stereo
            sample = (uint16_t(outBuff[i * 2]) << 16) | uint16_t(outBuff[i * 2 + 1]);
            *continueI2S = !Bluetooth_Source_SendAudioData(&sample);
        }
        if (channels == 1) {
            // mono
            sample = (uint16_t(outBuff[i]) << 16) | uint16_t(outBuff[i]);
            *continueI2S = !Bluetooth_Source_SendAudioData(&sample);
        }
    }
}

I get clear sound on the BT-headset & about 570 samples per call wich is very good for further optimization. A little glitch is conversion between signed / unsigned int16/32 but it can be managed! Best Dirk

schreibfaul1 commented 5 months ago

I have now transferred it to the Master, https://github.com/schreibfaul1/ESP32-audioI2S/pull/764

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

schreibfaul1 / ESP32-audioI2S

sending samples 1 by 1 consumes lots of cpu and smaller bugs - improvement included #754