Open leechwort opened 1 year ago
Ah, yes I suspect it’s that not having an fpu makes that delay object too heavy. I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu so it’s based around floats being efficient on the hardware. You can check out some leaf on my other repos, like electrosteel_embedded, in the daisy audio folder
On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn @.***> wrote:
Hello everyone! I'm interested in running your framework on my Blackpill board (STM32F411CEU6). I've noticed that I'm experiencing poor performance, and I'm wondering if I might be using your framework incorrectly or if this is expected behavior. Since there are no examples provided for embedded applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to ensure that the MCU isn't overwhelmed with a simple I2S operation. Below, I've included the relevant portions of my code from the main.c file, along with comments for clarity
include "leaf.h"
I2S_HandleTypeDef hi2s1;
// Constants
define SAMPLERATE 44000
define LEAF_BUFFER_SIZE (2*44100)
define BUFFER_SIZE 8192
// Buffers char mempool[10000]; // LEAF Memory pool uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
// Pointers, used for switching between buffers in DMA transfer volatile uint16_t *current_buffer_element_ptr = samplebuffer; volatile size_t current_buffer_element_cntr = 0;
// LEAF objects LEAF leaf; tCycle cycle; tHermiteDelay delay;
// Utility functions float rnd_func() { return ((float)rand() / (float)(RAND_MAX)); }
// Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called) // I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the // DAC. And vise versa void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
int main(void) { // CUBEMX stuff HAL_Init(); SystemClock_Config(); MX_GPIO_Init(); MX_DMA_Init(); MX_I2S1_Init(); MX_NVIC_Init();
// Callbacks for DMA transfer, where I switch buffers HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback); HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback); HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
// LEAF stuff init. LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func); tCycle_init(&cycle, &leaf); tCycle_setFreq(&cycle, 220); tHermiteDelay_init(&delay, 2000, 2500, &leaf); tHermiteDelay_setGain(&delay, 0.5f);
uint64_t counter = 0; while (1) { // If DMA controller succesfully finished transfer to audio codec, we can put new data there. // This part work ok when simple stuff are done there. if (current_buffer_element_cntr < BUFFER_SIZE / 2) { counter++;
if ((counter % 100000) == 10000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 20000) tCycle_setFreq(&cycle, 330); else if ((counter % 100000) == 30000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 40000) tCycle_setFreq(&cycle, 0); float processed_value = tCycle_tick(&cycle); //float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE //processed_value = delayed_value; // <<<<< LOOK HERE *(current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff * (1.0f + processed_value)); current_buffer_element_cntr++;
} } }
We can assume that the code is functioning correctly. I have provided a recording of the sound when the sequence is running as expected: Record - sequence, works ok https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a delay for the audio, and as a result, the sound became severely distorted: Record - sequence + delay, broken https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db
I also tested similar code on a host machine (you can find it in my fork and example: https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c) and it worked ok. This leads me to suspect that the issue might be related to performance limitations on the STM32 board.
In summary, I have a few questions:
- Can you suggest what might be causing this behavior on the STM32 board? Is there a specific way I should be using your framework for embedded systems that differs from using it on a host machine?
- Do you have any example projects specifically designed for the STM32 platform that I could refer to for guidance?
- It appears that the FPU (Floating-Point Unit) is not utilized in this framework. Do you have plans to implement FPU support in the future?
— Reply to this email directly, view it on GitHub https://github.com/spiricom/LEAF/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Oh wait, are you saying there is an fpu on your f4? That’s not enabled as part of leaf library, you just need to enable it as a compiler flag
On Fri, Sep 22, 2023 at 3:52 PM Jeff Snyder @.***> wrote:
Ah, yes I suspect it’s that not having an fpu makes that delay object too heavy. I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu so it’s based around floats being efficient on the hardware. You can check out some leaf on my other repos, like electrosteel_embedded, in the daisy audio folder
On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn @.***> wrote:
Hello everyone! I'm interested in running your framework on my Blackpill board (STM32F411CEU6). I've noticed that I'm experiencing poor performance, and I'm wondering if I might be using your framework incorrectly or if this is expected behavior. Since there are no examples provided for embedded applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to ensure that the MCU isn't overwhelmed with a simple I2S operation. Below, I've included the relevant portions of my code from the main.c file, along with comments for clarity
include "leaf.h"
I2S_HandleTypeDef hi2s1;
// Constants
define SAMPLERATE 44000
define LEAF_BUFFER_SIZE (2*44100)
define BUFFER_SIZE 8192
// Buffers char mempool[10000]; // LEAF Memory pool uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
// Pointers, used for switching between buffers in DMA transfer volatile uint16_t *current_buffer_element_ptr = samplebuffer; volatile size_t current_buffer_element_cntr = 0;
// LEAF objects LEAF leaf; tCycle cycle; tHermiteDelay delay;
// Utility functions float rnd_func() { return ((float)rand() / (float)(RAND_MAX)); }
// Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called) // I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the // DAC. And vise versa void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
int main(void) { // CUBEMX stuff HAL_Init(); SystemClock_Config(); MX_GPIO_Init(); MX_DMA_Init(); MX_I2S1_Init(); MX_NVIC_Init();
// Callbacks for DMA transfer, where I switch buffers HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback); HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback); HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
// LEAF stuff init. LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func); tCycle_init(&cycle, &leaf); tCycle_setFreq(&cycle, 220); tHermiteDelay_init(&delay, 2000, 2500, &leaf); tHermiteDelay_setGain(&delay, 0.5f);
uint64_t counter = 0; while (1) { // If DMA controller succesfully finished transfer to audio codec, we can put new data there. // This part work ok when simple stuff are done there. if (current_buffer_element_cntr < BUFFER_SIZE / 2) { counter++;
if ((counter % 100000) == 10000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 20000) tCycle_setFreq(&cycle, 330); else if ((counter % 100000) == 30000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 40000) tCycle_setFreq(&cycle, 0); float processed_value = tCycle_tick(&cycle); //float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE //processed_value = delayed_value; // <<<<< LOOK HERE *(current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff * (1.0f + processed_value)); current_buffer_element_cntr++;
} } }
We can assume that the code is functioning correctly. I have provided a recording of the sound when the sequence is running as expected: Record - sequence, works ok https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a delay for the audio, and as a result, the sound became severely distorted: Record - sequence + delay, broken https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db
I also tested similar code on a host machine (you can find it in my fork and example: https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c) and it worked ok. This leads me to suspect that the issue might be related to performance limitations on the STM32 board.
In summary, I have a few questions:
- Can you suggest what might be causing this behavior on the STM32 board? Is there a specific way I should be using your framework for embedded systems that differs from using it on a host machine?
- Do you have any example projects specifically designed for the STM32 platform that I could refer to for guidance?
- It appears that the FPU (Floating-Point Unit) is not utilized in this framework. Do you have plans to implement FPU support in the future?
— Reply to this email directly, view it on GitHub https://github.com/spiricom/LEAF/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U . You are receiving this because you are subscribed to this thread.Message ID: @.***>
hi artem and jeff
i believe the f411ce has a 32 bit float.
-mfloat-abi=hard -mfpu=fpv4-sp-d16
should turn it on.
using doubles will cause it to slow to a crawl though.
and even with an FPU, integer operations are often much faster.
On Sep 22, 2023, at 12:52 PM, Jeff Snyder @.***> wrote:
Ah, yes I suspect it’s that not having an fpu makes that delay object too heavy. I’ve been using leaf mostly in stm32f7 and stm32h7, both of which have fpu so it’s based around floats being efficient on the hardware. You can check out some leaf on my other repos, like electrosteel_embedded, in the daisy audio folder
On Fri, Sep 22, 2023 at 1:34 PM Artem Synytsyn @.***> wrote:
Hello everyone! I'm interested in running your framework on my Blackpill board (STM32F411CEU6). I've noticed that I'm experiencing poor performance, and I'm wondering if I might be using your framework incorrectly or if this is expected behavior. Since there are no examples provided for embedded applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to ensure that the MCU isn't overwhelmed with a simple I2S operation. Below, I've included the relevant portions of my code from the main.c file, along with comments for clarity
include "leaf.h"
I2S_HandleTypeDef hi2s1;
// Constants
define SAMPLERATE 44000
define LEAF_BUFFER_SIZE (2*44100)
define BUFFER_SIZE 8192
// Buffers char mempool[10000]; // LEAF Memory pool uint16_t samplebuffer[BUFFER_SIZE] = {0}; // Buffer, used for transmission to I2S codec with DMA
// Pointers, used for switching between buffers in DMA transfer volatile uint16_t *current_buffer_element_ptr = samplebuffer; volatile size_t current_buffer_element_cntr = 0;
// LEAF objects LEAF leaf; tCycle cycle; tHermiteDelay delay;
// Utility functions float rnd_func() { return ((float)rand() / (float)(RAND_MAX)); }
// Callbacks used for DMA transfers. When first part of buffer was sent(i2s_transfer_half_complited_callback called) // I put current_buffer_element_ptr to the beginning and allow LEAF to fill it, in this time another half of buffer was sent to the // DAC. And vise versa void i2s_transfer_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer + BUFFER_SIZE / 2; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
void i2s_transfer_half_complited_callback(I2S_HandleTypeDef *hi2s) { if (current_buffer_element_cntr >= BUFFER_SIZE / 2) { current_buffer_element_ptr = samplebuffer; current_buffer_element_cntr = 0; } else { printf("buffer overrun"); } }
int main(void) { // CUBEMX stuff HAL_Init(); SystemClock_Config(); MX_GPIO_Init(); MX_DMA_Init(); MX_I2S1_Init(); MX_NVIC_Init();
// Callbacks for DMA transfer, where I switch buffers HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_COMPLETE_CB_ID, &i2s_transfer_complited_callback); HAL_I2S_RegisterCallback(&hi2s1, HAL_I2S_TX_HALF_COMPLETE_CB_ID, &i2s_transfer_half_complited_callback); HAL_I2S_Transmit_DMA(&hi2s1, samplebuffer, sizeof(samplebuffer)/sizeof(samplebuffer[0]));
// LEAF stuff init. LEAF_init(&leaf, SAMPLERATE, mempool, LEAF_BUFFER_SIZE, &rnd_func); tCycle_init(&cycle, &leaf); tCycle_setFreq(&cycle, 220); tHermiteDelay_init(&delay, 2000, 2500, &leaf); tHermiteDelay_setGain(&delay, 0.5f);
uint64_t counter = 0; while (1) { // If DMA controller succesfully finished transfer to audio codec, we can put new data there. // This part work ok when simple stuff are done there. if (current_buffer_element_cntr < BUFFER_SIZE / 2) { counter++;
if ((counter % 100000) == 10000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 20000) tCycle_setFreq(&cycle, 330); else if ((counter % 100000) == 30000) tCycle_setFreq(&cycle, 220); else if ((counter % 100000) == 40000) tCycle_setFreq(&cycle, 0);
float processed_value = tCycle_tick(&cycle); //float delayed_value = tHermiteDelay_tick(&delay, processed_value); // <<<< LOOK HERE //processed_value = delayed_value; // <<<<< LOOK HERE (current_buffer_element_ptr + current_buffer_element_cntr) = (uint16_t) (0x0fff (1.0f + processed_value)); current_buffer_element_cntr++; } } }
We can assume that the code is functioning correctly. I have provided a recording of the sound when the sequence is running as expected: Record - sequence, works ok https://recorder.google.com/7a84e457-62ad-42a7-b041-f76fc30f43e0
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a delay for the audio, and as a result, the sound became severely distorted: Record - sequence + delay, broken https://recorder.google.com/e2904a43-fd83-4bb8-99f6-43528ba1f1db
I also tested similar code on a host machine (you can find it in my fork and example: https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c) and it worked ok. This leads me to suspect that the issue might be related to performance limitations on the STM32 board.
In summary, I have a few questions:
- Can you suggest what might be causing this behavior on the STM32 board? Is there a specific way I should be using your framework for embedded systems that differs from using it on a host machine?
- Do you have any example projects specifically designed for the STM32 platform that I could refer to for guidance?
- It appears that the FPU (Floating-Point Unit) is not utilized in this framework. Do you have plans to implement FPU support in the future?
— Reply to this email directly, view it on GitHub https://github.com/spiricom/LEAF/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGAY7F652H3K6KMHTXQRJTX3XDZZANCNFSM6AAAAAA5DNO54U . You are receiving this because you are subscribed to this thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/spiricom/LEAF/issues/15#issuecomment-1731963905, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNIJEETPA2MVACPM5Q3W5LX3XUBTANCNFSM6AAAAAA5DNO54U. You are receiving this because you are subscribed to this thread.
Thank your replies, guys! I've re-checked that FPU is enabled on my board, even made some benchmark for piece of code:
volatile uint32_t start = HAL_GetTick();
for (int i = 0; i < 10000000; i++)
for (int j = 0; i < 10000000; i++)
{
volatile float x = sinf((0.5f*i) * (0.4f*j));
}
volatile uint32_t end = HAL_GetTick() - start;
~19 sec without FPU and ~6 sec with. And no changes for LEAF performance. I'm still thinking maybe I'm missing something. Sure, F411 has much less perfomance then stm32f7, but is delay so much power-consuming?
Also, my fault about last question. I meant "CMSIS DSP" unit.
Ok, looks like things becomes better with increasing buffers, but since I have a limits for RAM blackpill maybe time to switch to something with onboard additional RAM:)
I looked over the leaf delay code, and there is nothing particularly slow in it. I can see some places to speed things up (use & and poweroftwo buffers for circular wrap - hand optimize the Hermite to minimize operations).
However, a couple tips that might help speed things up:
Try computing the first half after first half complete, and second half after second half complete. That’ll give you a little more leeway
Here’s what i’m doing with the 12-bit DAC (at 96k, 64 sample blocks)
void HAL_DAC_ConvCpltCallbackCh1(DAC_HandleTypeDef *hdac) { ChaosHalfBlock(32); }
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef* hdac) { ChaosHalfBlock(0); }
Also, I call the audio code in the callback (as you see above). Tn my code the main while() loop is lower priority.
Use at least -g3 with F4 processors, and -gfast if in a tight spot.
Tom
On Sep 22, 2023, at 2:55 PM, Artem Synytsyn @.***> wrote:
Ok, looks like things becomes better with increasing buffers, but since I have a limits for RAM blackpill maybe time to switch to something with onboard additional RAM:) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Hello everyone! I'm interested in running your framework on my Blackpill board (STM32F411CEU6). I've noticed that I'm experiencing poor performance, and I'm wondering if I might be using your framework incorrectly or if this is expected behavior. Since there are no examples provided for embedded applications, I'm concerned that I may have made a mistake somewhere.
To provide some context, I've connected a PCM5102 DAC along with DMA to ensure that the MCU isn't overwhelmed with a simple I2S operation. Below, I've included the relevant portions of my code from the main.c file, along with comments for clarity
We can assume that the code is functioning correctly. I have provided a recording of the sound when the sequence is running as expected: Record - sequence, works ok
Next, I uncommented a section marked as "<<<< LOOK HERE." This enabled a delay for the audio, and as a result, the sound became severely distorted: Record - sequence + delay, broken
I also tested similar code on a host machine (you can find it in my fork and example: https://github.com/leechwort/LEAF/blob/master/Examples/sawtooth-sequence.c) and it worked ok. This leads me to suspect that the issue might be related to performance limitations on the STM32 board.
In summary, I have a few questions: