Improving the memory model for embedded devices

TheSlowGrowth commented 4 years ago

I'm considering to use SOUL for an embedded project but I found a couple of key issues that greatly hinder the use of SOUL on embedded devices. These issues apply to embedded targets in general. As I understand, SOUL wants to specifically target these devices so I think it may be helpful to consider some of the problems early on.

My main issue is the memory model - or the lack thereof. Here's the reality on all embedded devices that I worked with so far:

Memory is limited
Memory is split across discontinous blocks which each have their own use, speed penalty, access requirements, etc
Often there's no dynamic memory allocation available.

SOUL so far pays no attention to the memory it uses. Each buffer or variable is equal and that is a flaw that must be adressed if compatibility with embedded hardware is a priority. I don't understand the inner workings well enough to make pull requests, but I have a couple of suggestions. You may very well already have solutions for all of this coming, but I didn't find any info so I figured it might be worth writing my thoughts down here.

Regarding the SOUL language itself

A SOUL compiler needs to know the use of a piece of memory in the SOUL code so that it can place that piece of memory where it fits best. While the compiler could in theory use the size of arrays, number of accesses, etc. as a way to determine the memory location, this will always end up in edge cases where the automatic placement produces suboptimal results. Moreso, it leaves the decision to the compiler which must be very smart to understand the performance impacts which I think will take a lot of development before it reaches a good state. I think a good solution would be annotations to give some guidance to the compiler. Memory could be annotated in SOUL code as bulkMem or fastMem where the former would indicate to the compiler that this piece of memory should be placed in a slower bulk storage if required. This should be "indicative" not "imperative" so that in the future, smarter compilers can start to make their own decisions.
When I code delay buffers, I need to specify the size in samples. I want to stay flexible WRT the samplerate, yet I would like to get the maximum performace WRT cache misses. It would be great it the SOUL language had an internal constant "processor.maxSampleRate" that would allow me to scale my arrays accordingly. When SOUL code is compiled without knowledge about the actual samplerate, this could be some large value defined by the runtime environment/compiler. When the patch is compiled for a specific samplerate, that can be equal to the actual samplerate so that the memory footprint is as small as possible. Of course this would have to take into account any up/downsampling but these things are known at compile time as well.

Regarding the current C++ code generator

Right now for me the only way to get SOUL code onto an embedded target is the C++ generator. I think even in the future [SOUL -> C++ ] + platform specific startup/glue code may be a great option for many platforms. However, the whole SOUL patch is right now created as one huge class which doesn't work with the segmented memory on an embedded device.

Here are my wishes for that:

I would like to specify to the C++ generator, how my memory is laid out. This could be akin to a linker script. I would specify which memory segments are available, their size and speed (abstracted. e.g. 1-5 or simply "fast"/"slow"). Then the generated C++ code would contain my main class (which will sit in fast memory and directly contain everything that sits in fast memory) as well as some nested classes (or just size_t constants) - one for each memory segment. I would then allocate one of each in the corresponding memory segment and hand them over to the main class in a constructor. Something like this:
```
class CppCodeGeneratedFromSoulPatch 
{
public:
struct MemSegment1Data 
{
    // ...
};
struct MemSegment2Data 
{
    // ...
};
// ctor takes preallocated memory for each section
CppCodeGeneratedFromSoulPatch(std::aligned_storage<MemSegment1Data> mem1, 
                              std::aligned_storage<MemSegment1Data> mem2)
{
    // uses placement new() to allocate the MemSegmentXData objects.
};
};
```
Not sure if this particular model would work, but SOUL could also use some sort of "pseudo-dynamic" memory allocation internally, where the C++ code for each processor/graph of the patch would get a pointer to the available memory regions, use whatever memory it needs and return a new pointer to the remaining memory for use in the next processor/graph.
I would like to be able to generate C++ code for one single samplerate, making use of as much optimization as I can. I would like to be able to have filter coefficients pre-calculated, etc. without hindering code portability to other samplerates and without loosing the possibility to develop my SOUL patch in a DAW on my computer.

I think SOUL is an amazing project that could really change the audio industry. I would love to hear your thoughts on these topics and I'm looking forward to the future developments.

cesaref commented 4 years ago

Hi, thanks for the comments. Feedback like this is hugely useful as it helps us make informed decisions about what language features to add, and what flexibility will be required in the future.

It would be useful to know what sort of platform you are thinking of with these limitations - i'm assuming some sort of low power device like an AVR or XMOS style processor, say an Arduino or something like that?

I think it's important to separate concerns about the current runtime implementation (especially the C++ backend) and what the language implies. As you can see, there is no mention of memory layout, or how the code is organised and executed and the language itself gives no guarantees about layout - this means we can change this to meet different requirements.

As present the backend that we ship produces a monolithic implementation of the complete graph, and so all of the state members get pulled into a single block of memory. For your application this is clearly not good, and you can imagine other situations where this is equally bad (high core count parallel systems for example). Expect there to be other options in the future!

I think you are right, ultimately the compiler is not going to be able to be smart enough to make the right decisions about memory segments, so the user must direct it to make the right choices. This will involve something like annotations on members to indicate memory type requirements, or additional information in the patch json indicating how the memory layout should be handled on a particular platform.

TheSlowGrowth commented 4 years ago

It would be useful to know what sort of platform you are thinking of with these limitations

I'm thinking of the Electrosmith Daisy, A Cortex M7, 480MHz, with L1 Cache and 64MB of external SDRAM. About as powerful as MCUs can go without an actual OS. Executing DSP with all data in internal SRAM is roughly 3-4 times faster than external SDRAM (with D-cache enabled) so the placement of delay lines, filter states, etc. is a concern, even on this beast of a processor. Sure, many SOUL patches will fit into the internal SRAM. But for more sophisticated reverbs, long delays, loopers, granular processors, etc. even this beast needs external SDRAM.

My goal is to provide some example code on how to execute complex SOUL patches on this powerful platform. They have faust & gen~ support already. Adding SOUL support would be a great showcase.

like an AVR

I don't think AVR is ever going to be a viable platform for high quality DSP other than hardcore optimized, handcrafted code. But Cortex M3 or M7 surely is, as the widespread adoption in modular synth equipment shows.

the language itself gives no guarantees about layout

I think that is a good choice, especially with respect to code portability. And I don't think guarantees are strictly necessary, but some guidance for the compilers would surely be helpful. Some abstract annotation like "fast", "bulk" could be a good solution. They don't imply any particular architecture and make no assumptions about where and how the code will run. But some knowledge about the type of memory is required for pretty much any embedded application that is not an embedded linux device. FPGA, dedicated DSPs, Cortex Ms, etc.

additional information in the patch json indicating how the memory layout should be handled on a particular platform

I can't imagine how this would work, can you elaborate? IMO the programmer would need to specify memory requirements per variable, which to my understanding makes most sense right in the code itself, not in a global metadata file. The specific memory layout of the actual device should not be relevant to the SOUL code, it should be part of the runtime / compiler. That's why I think annotations would work well. They specify the intention, not the actual realisation. Where the memory will actually be placed is entirely up to the runtime environment. It may just as well ignore the annotation entirely if it thinks that's a good idea. But with some guidance it can make better decisions. For the C++ generator, some sort of "memory map" / "linker file" would ultimately be required so that the generator can know the exact limitations of the target platform. Future JIT compilers for embedded platforms could be tailored to that specific platform directly.

Expect there to be other options in the future!

That is great to hear! I can't wait to get going with SOUL on embedded devices. It seems so promising for a faster and easier development cycle. I played a lot with Tracktion and SOUL - being able to code while the DAW transport is running is just amazing. I don't even have to leave the code editor to hear my changes right away. Now if only I could use the same code for the embedded platform without rewriting it in C++ myself...

I'm so much looking forward to the coming developments!

TheSlowGrowth commented 4 years ago

Here are some measurements on the STM32H7.

Reverb example patch
SOUL 0.9.41
1024 samples per frame
built in release mode with -O2
L1 D-cache and I-cache enabled

Allocated on internal SRAM, render time for 1024 frames, at 48kHz was 27.1ms. Maximum render time for 100% load would have been 21.3ms. This is not even realtime capable which is very disappointing given the simplicity of the reverb algorithm.

Allocated on external SDRAM, render time 1024 frames, at 48kHz was 29ms. Much better than I've seen in other examples. But given that the performance overall is quite bad, I assume most of the render time is lost in overhead and not actually slowed down much by the memory access times. With less overhead (and more time spent on actual DSP) I would expect the difference to be much bigger.

At -03 I see the same results.

TheSlowGrowth commented 4 years ago

Some more testing: I rewrote the reverb in plain C++ (copying everything including parameter smoothing, etc.) and these are the results under the same conditions as above:

Allocated in internal SRAM, 1024 samples: 20.4ms (wohoo, realtime requirements met!) Allocated in external SDRAM, 1024 samples: 21.8ms

I included a parameter to artificially increase the delay line sizes by a constant number of samples. The idea being that this will lead to more cache misses. This should be a better estimate for a more complex algorithm.

With an additional 4096 samples per delay buffer (= +400kB memory footprint) I get: Allocated in internal SRAM, 1024 samples: 20.4ms (same as before) Allocated in external SDRAM, 1024 samples: 22.0ms (only +0.2ms) With an additional 65k samples per delay buffer (= +6.1MB memory footprint) I get: Allocated in internal SRAM, 1024 samples: Not possible, only a total of 512kB of continuous memory available Allocated in external SDRAM, 1024 samples: 22.7ms (only +0.9ms)

So it seems the numbers I've heard from other people (3-4x difference) are totally off. Or my handwritten code is horribly inefficient. I'm still very surprised to see such a poor performance on this simple algorithm. I'm using the official startup files and code stack for the Electrosmith Daisy so I don't think my config is broken.

For reference, this is my handwritten version of the example reverb:

Click to expand!

``` #include #include constexpr uint32_t sizeBloat = 0; // additional bloat per delay line for more cache misses constexpr float reverbSampleRate = 48000; template class WrappedInt { public: WrappedInt(int initialValue = 0) { value_ = getWrapped(initialValue); } operator IntType() const { return value_; } IntType operator=(const IntType rhs) { value_ = getWrapped(rhs); return value_; } WrappedInt operator++(int) { value_++; if (value_ >= size) value_ = 0; return value_; } private: IntType getWrapped(IntType in) { while (in >= size) in -= size; return in; } IntType value_; }; template class CombFilter { public: CombFilter() : last_(0.0f), readIdx_(0), writeIdx_(sizeInSamples) {} void reset() { for (uint32_t i = 0; i < bufferSize_; i++) buffer_[i] = 0.0f; readIdx_ = 0; writeIdx_ = readIdx_ + sizeInSamples; last_ = 0.0f; } float process(const float input, const float feedback, const float damping) { const float outSample = buffer_[readIdx_]; last_ = (outSample * (1.0f - damping)) + (last_ * damping); buffer_[writeIdx_] = (gain_ * input) + (last_ * feedback); writeIdx_++; readIdx_++; return outSample; } private: static constexpr uint32_t bufferSize_ = sizeInSamples + sizeBloat; static constexpr float gain_ = 0.015f; float buffer_[bufferSize_]; float last_; WrappedInt readIdx_; WrappedInt writeIdx_; }; template class AllpassFilter { public: AllpassFilter() : readIdx_(0), writeIdx_(sizeInSamples) {} void reset() { for (uint32_t i = 0; i < bufferSize_; i++) buffer_[i] = 0.0f; readIdx_ = 0; writeIdx_ = readIdx_ + sizeInSamples; } float process(const float input) { const float bufferedSample = buffer_[readIdx_]; buffer_[writeIdx_] = input + (bufferedSample * 0.5f); writeIdx_++; readIdx_++; return bufferedSample - input; } private: static constexpr uint32_t bufferSize_ = sizeInSamples + sizeBloat; float buffer_[bufferSize_]; WrappedInt readIdx_; WrappedInt writeIdx_; }; class ParameterRamp { public: ParameterRamp() : targetValue_(0.0f), currentValue_(0.0f), rampIncrement_(0.0f), rampSamples_(0) {} void skipToTargetValue() { currentValue_ = targetValue_; rampSamples_ = 0; } void setTargetValue(const float newTarget) { targetValue_ = newTarget; const auto diff = targetValue_ - currentValue_; const auto rampSeconds = abs(diff) / slewRate_; rampSamples_ = int (reverbSampleRate * rampSeconds); rampIncrement_ = diff / float (rampSamples_); } float getSmoothedValue() { if (rampSamples_ > 0) { currentValue_ += rampIncrement_; --rampSamples_; } return currentValue_; } private: static constexpr float slewRate_ = 20.0f; float targetValue_; float currentValue_; float rampIncrement_; int rampSamples_; }; template class ReverbChannel { public: void reset() { comb1_.reset(); comb2_.reset(); comb3_.reset(); comb4_.reset(); comb5_.reset(); comb6_.reset(); comb7_.reset(); comb8_.reset(); ap1_.reset(); ap2_.reset(); ap3_.reset(); ap4_.reset(); } float process(const float input, const float feedback, const float damping) { float combOut = 0.0f; combOut += comb1_.process(input, feedback, damping); combOut += comb2_.process(input, feedback, damping); combOut += comb3_.process(input, feedback, damping); combOut += comb4_.process(input, feedback, damping); combOut += comb5_.process(input, feedback, damping); combOut += comb6_.process(input, feedback, damping); combOut += comb7_.process(input, feedback, damping); combOut += comb8_.process(input, feedback, damping); float outputSample = combOut; outputSample = ap1_.process(outputSample); outputSample = ap2_.process(outputSample); outputSample = ap3_.process(outputSample); outputSample = ap4_.process(outputSample); return outputSample; } private: AllpassFilter<225 + offset> ap1_; AllpassFilter<341 + offset> ap2_; AllpassFilter<441 + offset> ap3_; AllpassFilter<556 + offset> ap4_; CombFilter<1116 + offset> comb1_; CombFilter<1188 + offset> comb2_; CombFilter<1277 + offset> comb3_; CombFilter<1356 + offset> comb4_; CombFilter<1422 + offset> comb5_; CombFilter<1491 + offset> comb6_; CombFilter<1557 + offset> comb7_; CombFilter<1617 + offset> comb8_; }; class ReverbCpp { public: ReverbCpp() { setParameters(0.4f, // dry 0.33f, // wet 1.0f, // width 0.5f, // damping 0.9f); // room size reset(); } void reset() { tankLeft_.reset(); tankRight_.reset(); dryRamp_.skipToTargetValue(); wet1Ramp_.skipToTargetValue(); wet2Ramp_.skipToTargetValue(); dampingRamp_.skipToTargetValue(); feedbackRamp_.skipToTargetValue(); } void setParameters(const float dry, const float wet, const float width, const float damping, const float roomSize) { dryRamp_.setTargetValue(dry * dry); wet1Ramp_.setTargetValue(0.5f * wet * wetScaleFactor_ * (1.0f + width)); wet2Ramp_.setTargetValue(0.5f * wet * wetScaleFactor_ * (1.0f - width)); dampingRamp_.setTargetValue(damping * damping); feedbackRamp_.setTargetValue(roomSize * roomScaleFactor_ + roomOffset_); } void process(float* input, float** outputs, uint32_t numSamples) { int i = 0; while (numSamples) { const auto dryParam = dryRamp_.getSmoothedValue(); const auto wet1Param = wet1Ramp_.getSmoothedValue(); const auto wet2Param = wet2Ramp_.getSmoothedValue(); const auto dampingParam = dampingRamp_.getSmoothedValue(); const auto feedbackParam = feedbackRamp_.getSmoothedValue(); const float drySample = input[i]; const float wetLSample = tankLeft_.process(drySample, feedbackParam, dampingParam); const float wetRSample = tankRight_.process(drySample, feedbackParam, dampingParam); const float outputL = dryParam * drySample + wet1Param * wetLSample + wet2Param * wetRSample; const float outputR = dryParam * drySample + wet1Param * wetRSample + wet2Param * wetLSample; outputs[0][i] = outputL; outputs[1][i] = outputR; i++; numSamples--; } } private: // Various tuning factors for the reverb static constexpr float wetScaleFactor_ = 3.0f; static constexpr float dryScaleFactor_ = 2.0f; static constexpr float roomScaleFactor_ = 0.28f; static constexpr float roomOffset_ = 0.7f; static constexpr float dampScaleFactor_ = 0.4f; ReverbChannel<0> tankLeft_; ReverbChannel<23> tankRight_; ParameterRamp dryRamp_; ParameterRamp wet1Ramp_; ParameterRamp wet2Ramp_; ParameterRamp dampingRamp_; ParameterRamp feedbackRamp_; }; ```

sletz commented 4 years ago

Interesting: 1) how do you measure process duration? 2) are you sure the code is inlined (calls to filters process and so on)?

TheSlowGrowth commented 4 years ago

@sletz, thank you for the comment. 1) I use a test point that I set to high during the rendering and capture the waveform with an oscilloscope. 2) I had foolishly assumed that for these small functions, the compiler would inline them anyway. Turns out, it didn't, not even the WrappedInt::operator++(int). Adding the inline keyword didn't change anything. When I added inline __attribute__((always_inline)) it was finally inlined.

I forced inlining on CombFilter::process(), AllpassFilter::process() and on all member functions of WrappedInt. Now these are the render times:

No additional delay line length, internal SRAM: 13ms (what a difference!) No additional delay line length, external SDRAM: 14.5ms +4096 samples per delay line, internal SRAM: 13.1ms +4096 samples per delay line, external SDRAM: 15.2ms +65k samples per delay line, external SDRAM: 15.5ms (interestingly the additional overhead compared to internal SRAM is comparable to the non-inlined code) Just for fun: +65k samples per delay line, external SDRAM, D-cache disabled: 93ms (good job, D-cache!)

EDIT: I'm not sure why gcc isn't inlining my function calls automatically. I use -O3 so it should prioritize speed over size and -finline-functions should already be enabled. Even if I manually specify -finline-functions -finline-limit=100000 -Winline it won't inline anything without me forcing it. Weird. I had hoped to increase the performance of generated SOUL code with these options.

cesaref commented 4 years ago

I had a quick look at the reverb code running on a small arm (a bela board running an A7@1Ghz) and I see quite a difference in performance from the code if I strip out the parameter streaming logic. The code in that reverb example is a straight re-write of the reverb in JUCE which we used in order to work out what basic DSP performs like when moved across without being smart about it! Our tests show the SOUL JIT runs faster than the JUCE version on all platforms, but it doesn't address whether the original logic was sensible.

For example, I found that the example Reverb (stereo in, stereo out with parameter smoothing) took 17% CPU on the bela board. Stripping the code back by removing the smoothing and running a mono in/out comparison used 3.5%, so it would appear that 10% of the time is spent smoothing parameters rather than actually running reverb. I suggest you try something similar and see if this massively reduces the overhead - it'll give you an idea where to look.

The mono reverb I tested can be found here:

https://soul.dev/lab?id=50586f66116bcd13fcd427c3185d83d3

cesaref commented 4 years ago

I'll close this for now. If you have further questions it's best to contact us through our slack channel to discuss further

soul-lang / SOUL

Improving the memory model for embedded devices #33

Regarding the SOUL language itself

Regarding the current C++ code generator