jeromeDms commented 3 years ago

Hi Thanks for your good work, and congratulations. I'm trying to figure out if it would be possible to run Vocals / Music separation in real time, with acceptable latency, on an ARM micro controller. From what I read in your description, we need 300 Mb of RAM to use the algorithm ? Does this includes the audio buffers to capture incoming audio data while processing a previous block of data ? Or are those 300 Mb only for Spleeter ? Also 300 Mbits or 300 Mbytes ? Thanks Jerome

james34602 commented 3 years ago

Hello Jerome

I sorry to say "real time" isn't possible on such hardware, the "real time" implementation will require we treat current frame as last frame on CNN input, while treat recent history frames as 1 : last - 1, that mean running CNN for every frame... This way our latency get into 23.2 ms, I tried on x86, it works, latency criteria satisfied, but computation amount are too expensive. Last time I check, the top model TI DSP get us to 23.2 ms using above framing scheme, forget about it, they are expensive.

Depends on what ARM micro-controller and what's the application. I consider Android phone run on ARM CPU, such devices were able to run SpleeterRT in online, not real time, at least would be no problem for playback audio and separate the audio at the same time, the user will still think we are running real time, because playback audio is usually heavily pre-buffered.

"Standard" setting roughly cost 300 Mb, reduced setting like T = 64 or T = 128 would reduce to 150 Mb in total.

If memory is concern, more action need to do for that, we need to convert the CNN weights to 8 bit(unsigned char), I'm pretty confident the network won't degrade. There should be instructions to accelerate FMA for unsigned char and more specify matrix multiplication library would be appropriate.

jeromeDms commented 3 years ago

Thanks for your detailed answer James. The app I have in mind would be a Micro controller with stereo audio Input and 2 stereo separated outputs, one for music, one for voice. Audio quality could be limited to 16b / 44.1KHz, 24b/ 96KHz would be ideal, but not required to start.

What I mean by "realtime" is the possibility to output the separated voice/music, while the input is playing, no need to use an intermediate sampled file of the audio input, then process then output. Latency would not be a big issue, I would say a latency of 1 second would be still acceptable.

Sampling at 16b / 44100Hz / Stereo requires 88200 audio samples per second, 16bits each. Those incoming samples could be spited in 512 bytes chunks (or another size) being processed then outputted, while the next 512 bytes are being sampled, using a double buffering system.

But I have no idea about the number of instructions/cycles and buffer size required by spletterRT to process a chunk of 512 bytes, may be you have an idea about this ?

Again, thanks for your time. Jerome

james34602 commented 3 years ago

Number of cycle is highly dependent on GEMM library.

If you implement your own, you will find calculating the big o complexity is relatively simple, GEMM have 3 for loops, each CNN layer have 1 GEMM, so, multiply the number of 3 for loops each layer and then sum them.

Then, you quickly realize SpleeterRT is mainly doing GEMM and all the buffering and FFT are negligible parts.

For the double buffer thingy I done on VST plugin, well, you probably need at least 2 cores.

james34602 commented 3 years ago

Process a frame doesn't mean we must trigger the CNN(GEMM), because we run CNN when we collect enough frames, that's the reason behind why we could double buffer the spectrogram, and then running CNN in the background.

If you got dual core CPU, you can run main program in core 1 and then core 1 will also be running buffering, FFT, while core 2 is busying running CNN.

Of course, more than 2 cores gain you more flexibility, for example, you could either parallelize the FFT of separated sources, or use the extra cores on parallelizing GEMM.

jeromeDms commented 3 years ago

Thanks for your answers James.

I'll try to understand how you handle processing one block of incoming data in your VST example. What I'd like to achieve is simpler, I just want to remove vocals from an audio source in realtime, I do not need to extract drums and others components.

I must say I'm a bit lost for now, looking into your code, I guess I have to simplify Spleeter4StemsProcessSamples() to only remove the vocals and implement my own double buffering system, processing data arriving from my I2S codec, in real-time.

jeromeDms commented 3 years ago

James, do you have some guidelines to simplify Spleeter4StemsProcessSamples() to only extract music ? The ideal would be to have some #define in your vst source code, and some #ifdef.

define EXTRACT_VOCALS

define EXTRACT_DRUMS

...

ifdef EXTRACT_VOCALS

..

endif

This would definitely help to simplify all the process.

james34602 commented 3 years ago

Sorry for late reply, I've been quite busy.

If you just need to separate vocal, then you must modify Spleeter4StemsInit() and void LLPAMSProcessNPR(Spleeter4Stems *msr)

https://github.com/james34602/SpleeterRT/blob/9bba26305e12c34dece31fd6cc70d029126151e7/VST/Source/Spleeter4Stems.c#L367

Here we wake up 4 threads to do stems processing, if you just need both vocal and accompaniment, you can process 1 network alone, you can either process vocal or accompaniment and take the network from 4 stems model and load it.

If you take vocal network coefficients, then you can directly get the mask function of vocal from the CNN output. After that, you perform 2 IFFT(Left and Right) to get back separated vocal, don't forget to multiply original spectrum with the mask function like I did in void LLPAMSProcessNPR(Spleeter4Stems *msr).

You can also Original Left&Right FFT spectrum - Vocal Left&Right spectrum to get accompaniment, after that do IFFT, same as separating vocal, this way save you half storage requirement and separate Vocal and Accompaniment at the same time.

jeromeDms commented 3 years ago

No problem James !

I'm still trying to understand how to simplify the code for removing vocals only. From what I understand until now :

define ANALYSIS_OVERLAP 4

define COMPONENTS 8 // Is this number of mono channels to separate ? (2 for vocals, 2 for drums, etc) for a total of 4 CNN ?

define TASK_NB 5 // Is this 1 task + 4 tasks (one for each stem or one for each ANALYSIS_OVERLAP) ?

spleeter nn[4]; // Ok this is to load coefs corresponding to the 4 stems

float maskPtr[2][4]; // Here I guess the 2 is Left and Right channels, what about the 4 ? is this for the number of stems ? float complexSpectrogram[2][4]; // Here I guess the 2 is Left and Right channels, what about the 4 ? is this for the number of stems ?

Those hardcoded numbers to define arrays lengths are confusing to me (ok I'm a newbie in the CNN world), but understanding how to modify those numbers would help me to simplify the code to only remove vocals.

Thanks again James

james34602 commented 3 years ago

"#define COMPONENTS 8 // Is this number of mono channels to separate ? (2 for vocals, 2 for drums, etc) for a total of 4 CNN ?" Yes. "#define TASK_NB 5 // Is this 1 task + 4 tasks (one for each stem or one for each ANALYSIS_OVERLAP) ?" Yes

spleeter nn[4]; // Ok this is to load coefs corresponding to the 4 stems No. You just need to load 1 stem if want to separate both vocal and accompaniment at the same time.

"float *maskPtr[2][4]; // Here I guess the 2 is Left and Right channels, what about the 4 ? is this for the number of stems ?" Yes.

"float *complexSpectrogram[2][4]; // Here I guess the 2 is Left and Right channels, what about the 4 ? is this for the number of stems" No. [0] Left real. [1] Left imaginary. [2] Right real. [3] Right imaginary. See. https://github.com/james34602/SpleeterRT/blob/9bba26305e12c34dece31fd6cc70d029126151e7/VST/Source/Spleeter4Stems.c#L345

jeromeDms commented 3 years ago

Thanks for your answers James, things become clearer 👍

So in order to use only Vocal removal, I should modify as follow :

define NB_STEMS 1

define COMPONENTS (2 * NB_STEMS)

define TASK_NB (1 + NB_STEMS)

spleeter nn[NB_STEMS]; void coeffProvPtr[NB_STEMS]; float maskPtr[2][NB_STEMS];

float *complexSpectrogram[2][4]; // This on remains as it was.

void Spleeter4StemsInit(Spleeter4Stems msr, int initSpectralBinLimit, int initTimeStep, void coeffProvider[NB_STEMS]);

I load vocal4stems.dat in coeffProvPtr[0] for a single stem

About the tasks,

task_start(msr, 1);
task_start(msr, 2);
task_start(msr, 3);
task_start(msr, 4);

Should I replace with the below conditional defines ?

if TASK_NB > 1

task_start(msr, 1);

endif

if TASK_NB > 2

task_start(msr, 2);

endif

if TASK_NB > 3

task_start(msr, 3);

endif

if TASK_NB > 4

task_start(msr, 4);

endif

In a more general way, it could be nice to have some conditional defines in your code so your source could become more generic.

jeromeDms commented 3 years ago

BTW, I succesfully tested your code on Mac VST pluggin using Juce, it works very well, congratulations for your really good work. The only problem is latency before the seperation starts, it takes about 16sec for 4 Stems, running on a Mac book Pro with i9 processor, with Intel MKL Lib installed (I was expecting less according to your readMe file ?)

I did not tweak FFT_SIZE for now, but the result is very nice.

james34602 commented 3 years ago

Thanks for your answers James, things become clearer 👍

So in order to use only Vocal removal, I should modify as follow :

define NB_STEMS 1

define COMPONENTS (2 * NB_STEMS)

define TASK_NB (1 + NB_STEMS)

spleeter nn[NB_STEMS]; void coeffProvPtr[NB_STEMS]; float maskPtr[2][NB_STEMS];

float *complexSpectrogram[2][4]; // This on remains as it was.

void Spleeter4StemsInit(Spleeter4Stems msr, int initSpectralBinLimit, int initTimeStep, void coeffProvider[NB_STEMS]);

I load vocal4stems.dat in coeffProvPtr[0] for a single stem

About the tasks,
task_start(msr, 1);
task_start(msr, 2);
task_start(msr, 3);
task_start(msr, 4);
Should I replace with the below conditional defines ?

if TASK_NB > 1

task_start(msr, 1);

endif

if TASK_NB > 2

task_start(msr, 2);

endif

if TASK_NB > 3

task_start(msr, 3);

endif

if TASK_NB > 4

task_start(msr, 4);

endif

In a more general way, it could be nice to have some conditional defines in your code so your source could become more generic.

Yes, I think conditional defines is good but things got a little bit complicated when things comes to efficiency and number of stems, if you need 4 sources, you may need just 3 stems, the spectral subtraction will save you tons of computation. That's why a few simple C macro alone is not enough.

james34602 commented 3 years ago

BTW, I succesfully tested your code on Mac VST pluggin using Juce, it works very well, congratulations for your really good work. The only problem is latency before the seperation starts, it takes about 16sec for 4 Stems, running on a Mac book Pro with i9 processor, with Intel MKL Lib installed (I was expecting less according to your readMe file ?)

I did not tweak FFT_SIZE for now, but the result is very nice.

https://github.com/james34602/SpleeterRT/blob/9bba26305e12c34dece31fd6cc70d029126151e7/VST/Source/PluginProcessor.cpp#L124

"256" here is latency parameters.

It control how many frames we collect until perform CNN.

It must be power of 2(It' doesn't alway have to, but please don't try non power of 2 initTimeStep, unless you are research well on Spleeter neural network) and should be always >= 64.

Smaller the initTimeStep, lower the latency and usually reduce separation quality, since all CNN can "see" are initTimeStep frames of spectrogram. At some point when you try to increase initTimeStep, the quality improvement will stop.

FFTSIZE shouldn't be tweaked, unless you have train a new Spleeter model with modified FFT size. The frequency-related parameter that is initSpectralBinLimit, which tell you how many FFT bins we want to feed into CNN, initSpectralBinLimit=1024 & FFTSIZE=4096 mean that CNN will gonna analyse spectrogram from 0 - 10000Hz(Exclude DC bin).

jeromeDms commented 3 years ago

Thanks again for the explanations.

If I want to only remove vocals, I understand that spectral subtraction of the vocals from the original audio is the simpler. Can you point me out where this should be done in your VST plugin ?

In your Executable source, I can see for 2 stems :

size_t spectralframeCount = stft(st, splittedBuffer[0], splittedBuffer[1], finalSize, &reL, &imL, &reR, &imR); processMT(framesThreading, analyseBinLimit, timeStep, spectralframeCount, coeffProvPtr1, unaffectedWeight, reL, imL, reR, imR, 0); size_t outLen = istft(st, reL, imL, reR, imR, spectralframeCount, &out1L, &out1R);

Then the subtraction on the time representation

out2L[i] = splittedBuffer[0][i] - out1L[i]; out2R[i] = splittedBuffer[1][i] - out1R[i];

But in your VST source, it does not seems it is done like this because of the 4 stems that are calculated individually.

May be I should start from your executable folder (rather than VST), but it is not intended for realtime and processing inputs buffers.

Any help would be greatly appreciated. Thanks for your time

james34602 commented 3 years ago

The VST version is kind of old, it's not really old, but it's original, look more like original Spleeter, which is separate 4 sources by running 4 sources.

The "offline" version is relatively new, I use spectral subtraction technique to make the SpleeterRT much faster.

jeromeDms commented 3 years ago

What you mean by "offline" version is the code under the "Executable" directory ?

This codes separates audio from a file input. Should I adapt this entire file processing to realtime by processing chunks of audio data and adjusting totalPCMFrameCount according to the amount of incoming data for each chunk ?

james34602 commented 3 years ago

"What you mean by "offline" version is the code under the "Executable" directory ?" Yes.

You should take just the spectral subtraction idea to real time version.

Use Offline STFT on real time create block artifacts because it breaks the overlap.

jeromeDms commented 3 years ago

Hi James, I successfully simplified your VST source to a single stem, loading the vocal model results in only the voice being extracted from the audio input.

I would like now to perform spectral subtraction, to get everything except the voice. I think this subtraction should be done in LLPAMSProcessNPR() of the VST source code.

Can you point me out what should be subtracted and where ?<

Or maybe subtraction in the time domain is also possible ?

Thanks.

jeromeDms commented 3 years ago

James, I spent more time trying to understand where the spectral subtraction should be done. From what I understand until now :

subtraction must be done on the complex spectrogram buffers.
task 0 treats the left channel input and process the 64 frames
task 1 treats the right channel input
I must save the calculated complex spectrogram of the input at the beginning of each 64 frames. (once the task 1 ended)
I must subtract the saved input complex spectrogram from the new calculated complex spectrogram just before the task 0 restarts after the task 1 ended.

Am I far from what should be done ? Sorry for my multiple questions, being not very skilled in such a spectrograms, and trying to understand your code. Thanks for your time again. Jerome

james34602 commented 3 years ago

Wait a moment. I should have done spectral subtraction for SpleeterRT long time ago. I could provide some code for that.

james34602 commented 3 years ago

Hi James, I successfully simplified your VST source to a single stem, loading the vocal model results in only the voice being extracted from the audio input.

I would like now to perform spectral subtraction, to get everything except the voice. I think this subtraction should be done in LLPAMSProcessNPR() of the VST source code.

Can you point me out what should be subtracted and where ?<

Or maybe subtraction in the time domain is also possible ?

Thanks.

Done on time domain is possible, however, you would need to cache tons of time domain history, which can be memory unfriendly I guess.

jeromeDms commented 3 years ago

I understood your spectral subtraction from your offline version (I can see FFT of the input, processing, then inverse FFT), but difficult to apply it to the VST version, as it is not so clear in the code.

jeromeDms commented 3 years ago

Done on time domain is possible, however, you would need to cache tons of time domain history, which can be memory unfriendly I guess.

Yes I tried, but it requires too much buffer.

jeromeDms commented 3 years ago

Wait a moment. I should have done spectral subtraction for SpleeterRT long time ago. I could provide some code for that.

That would be great :-)

james34602 commented 3 years ago

int symIdx; unsigned int bitRevFwd, bitRevSym; float mask1L, mask1R, mask2L, mask2R; msr->timeDomainOut[0][0] = 0.0f; msr->timeDomainOut[1][0] = 0.0f; msr->timeDomainOut[2][0] = msr->complexSpectrogram[msr->outputFramePtr][0][HALFWNDLEN msr->nnMaskCursor]; msr->timeDomainOut[3][0] = msr->complexSpectrogram[msr->outputFramePtr][2][HALFWNDLEN msr->nnMaskCursor]; for (i = 1; i < HALFWNDLEN; i++) { symIdx = FFTSIZE - i; bitRevFwd = msr->mBitRev[i]; bitRevSym = msr->mBitRev[symIdx]; mask1L = 0.2f, mask1R = 0.2f; if (i < msr->analyseBinLimit) { mask1L = msr->maskPtr[msr->outputFramePtr][0 (msr->analyseBinLimit msr->timeStep) + msr->analyseBinLimit msr->nnMaskCursor + i]; mask1R = msr->maskPtr[msr->outputFramePtr][1 (msr->analyseBinLimit msr->timeStep) + msr->analyseBinLimit msr->nnMaskCursor + i]; } mask2L = 1.0f - mask1L; mask2R = 1.0f - mask1R; if (mask2L < 0.0f) { mask1L -= mask2L; mask2L = 0.0f; } if (mask2R < 0.0f) { mask1R -= mask2R; mask2R = 0.0f; } float S1leftMaskedReal = msr->complexSpectrogram[msr->outputFramePtr][0][HALFWNDLEN msr->nnMaskCursor + i] mask1L; float S1leftMaskedImag = msr->complexSpectrogram[msr->outputFramePtr][1][HALFWNDLEN msr->nnMaskCursor + i] mask1L; float S1rightMaskedReal = msr->complexSpectrogram[msr->outputFramePtr][2][HALFWNDLEN msr->nnMaskCursor + i] mask1R; float S1rightMaskedImag = msr->complexSpectrogram[msr->outputFramePtr][3][HALFWNDLEN msr->nnMaskCursor + i] mask1R; float S2leftMaskedReal = msr->complexSpectrogram[msr->outputFramePtr][0][HALFWNDLEN msr->nnMaskCursor + i] mask2L; float S2leftMaskedImag = msr->complexSpectrogram[msr->outputFramePtr][1][HALFWNDLEN msr->nnMaskCursor + i] mask2L; float S2rightMaskedReal = msr->complexSpectrogram[msr->outputFramePtr][2][HALFWNDLEN msr->nnMaskCursor + i] mask2R; float S2rightMaskedImag = msr->complexSpectrogram[msr->outputFramePtr][3][HALFWNDLEN msr->nnMaskCursor + i] mask2R; msr->timeDomainOut[0][bitRevFwd] = S1leftMaskedReal + S1leftMaskedImag; msr->timeDomainOut[0][bitRevSym] = S1leftMaskedReal - S1leftMaskedImag; msr->timeDomainOut[1][bitRevFwd] = S1rightMaskedReal + S1rightMaskedImag; msr->timeDomainOut[1][bitRevSym] = S1rightMaskedReal - S1rightMaskedImag; msr->timeDomainOut[2][bitRevFwd] = S2leftMaskedReal + S2leftMaskedImag; msr->timeDomainOut[2][bitRevSym] = S2leftMaskedReal - S2leftMaskedImag; msr->timeDomainOut[3][bitRevFwd] = S2rightMaskedReal + S2rightMaskedImag; msr->timeDomainOut[3][bitRevSym] = S2rightMaskedReal - S2rightMaskedImag; }

You can separate both sources using spectral subtraction on mask function. This way we don't need too much extra memory, because the mask function are require no matter how many sources you want.

jeromeDms commented 3 years ago

Fantastic ! I guess this is the code to merge in LLPAMSProcessNPR(Spleeter4Stems *msr) Let me try to merge this. Thanks

jeromeDms commented 3 years ago

It works !!!!

I just had to replace mask1L = msr->maskPtr[msr->outputFramePtr][0 (msr->analyseBinLimit msr->timeStep) + msr->analyseBinLimit msr->nnMaskCursor + i]; With mask1L = msr->maskPtr[msr->outputFramePtr][0][0 (msr->analyseBinLimit msr->timeStep) + msr->analyseBinLimit msr->nnMaskCursor + i];

And same for mask1R

Thanks James !!!

jeromeDms commented 3 years ago

I sorry to say "real time" isn't possible on such hardware, the "real time" implementation will require we treat current frame as last frame on CNN input, while treat recent history frames as 1 : last - 1, that mean running CNN for every frame... This way our latency get into 23.2 ms, I tried on x86, it works, latency criteria satisfied, but computation amount are too expensive. Last time I check, the top model TI DSP get us to 23.2 ms using above framing scheme, forget about it, they are expensive.

I was wondering on which TI DSP you succeeded in running the code and what was the price of such a DSP ? When you mention 23.2ms, do you mean it is the time required to process one block of 512 bytes @ 44.1KHz while running on DSP ?

james34602 commented 3 years ago

I sorry to say "real time" isn't possible on such hardware, the "real time" implementation will require we treat current frame as last frame on CNN input, while treat recent history frames as 1 : last - 1, that mean running CNN for every frame... This way our latency get into 23.2 ms, I tried on x86, it works, latency criteria satisfied, but computation amount are too expensive. Last time I check, the top model TI DSP get us to 23.2 ms using above framing scheme, forget about it, they are expensive.

I was wondering on which TI DSP you succeeded in running the code and what was the price of such a DSP ? When you mention 23.2ms, do you mean it is the time required to process one block of 512 bytes @ 44.1KHz while running on DSP ?

The top notch one that suppose to run SAR radar.

I can't show you the code how to get down the true sample output latency to around 35ms-70ms and it doesn't matter, because no known low power implementation is possible. The code modification needed is also pretty huge, rewrite the whole LLPAMSProcessNPR() is needed, I actually did that, and experiment show it works, it really output samples in true real time.

Low power is important, if the system is real time, but consume huge enormous amount of power, then, it's a military project, not civilian friendly.

jeromeDms commented 3 years ago

What I meant by "Realtime" is what you have done on your VST source code. The possibility to process separation without having to save a file, process and save results.

For me several seconds of latency are acceptable, and when I run your VST code on my Mac it is realtime (for me). Would that be possible on a DSP or High end ARM or any other CPU or even Raspberry Pi 4, with a several seconds latency just like your VST plugin running on my Mac ?

james34602 commented 3 years ago

Last time I check we can run the plugin in "real time" on my Samsung Galaxy S5(Exynos version).

I use Hiby player as DSP test framework, if you need demo, go Google Play download Hiby music player and contact me on Telegram(@James34602).

jeromeDms commented 3 years ago

So if this works on galaxy s5, I think it should work on raspberry pi 4 with openblas as the gemm, openblas is available for arm and many other CPU’s

James, can you confirm it is worth testing this vst plugin source code on raspberry pi 4 with openblas if we accept those 10sec latency or do you really think there is no way to make this working ?

again I’m talking about the Vst source modified with spectral subtraction on a single stem, and we accept multiple seconds latency

james34602 commented 3 years ago

Multiple seconds latency, yes.

OpenBlas is exactly what I'm using, I'm confident it will work if you use OpenBlas.

Plain GEMM(No library) on Samsung Galaxy S5 will work just fine, I think the Samsung CPU I tested faster than the one in Raspberry Pi 4.

jeromeDms commented 3 years ago

Great ! I have already built my standalone app on raspberry virtual machine on my Mac, using openblas and it works. I now need to find a pi 4 board and compile directly on it I’ll let you know if this works Thanks for your great support and sorry for my previous multiple questions

james34602 / SpleeterRT

Running on ARM micro controller #4

define EXTRACT_VOCALS

define EXTRACT_DRUMS

ifdef EXTRACT_VOCALS

endif

define ANALYSIS_OVERLAP 4

define COMPONENTS 8 // Is this number of mono channels to separate ? (2 for vocals, 2 for drums, etc) for a total of 4 CNN ?

define TASK_NB 5 // Is this 1 task + 4 tasks (one for each stem or one for each ANALYSIS_OVERLAP) ?

define NB_STEMS 1

define COMPONENTS (2 * NB_STEMS)

define TASK_NB (1 + NB_STEMS)

if TASK_NB > 1

endif

if TASK_NB > 2

endif

if TASK_NB > 3

endif

if TASK_NB > 4

endif

define NB_STEMS 1

define COMPONENTS (2 * NB_STEMS)

define TASK_NB (1 + NB_STEMS)

if TASK_NB > 1

endif

if TASK_NB > 2

endif

if TASK_NB > 3

endif

if TASK_NB > 4

endif