ArrayMath.MultiplyAdd on Unity Android

danslobodan commented 4 months ago

Hi and thanks for making this great tool!

I'm using meltysynth to do some real-time rendering of audio in Unity, through the OnAudioFilterRead callback function.

When I used the IL2CPP to compile the project for android and ran it on multiple devices, the audio was stuttering horribly.

Upon running some profiling sessions I found that almost all of the CPU load was actually from the Garbage Collector, specifically in this function:

        public static void MultiplyAdd(float a, float[] x, float[] destination)
        {
            var vx = MemoryMarshal.Cast<float, Vector<float>>(x);
            var vd = MemoryMarshal.Cast<float, Vector<float>>(destination);

            var count = 0;

            for (var i = 0; i < vd.Length; i++)
            {
                vd[i] += a * vx[i];
                count += Vector<float>.Count;
            }

            for (var i = count; i < destination.Length; i++)
            {
                destination[i] += a * x[i];
            }
        }

Perhaps this is premature optimization, since changing the code to:

        public static void MultiplyAdd(float a, float[] x, float[] destination)
        {
            for (int i = 0; i < x.Length; i++)
            {
                destination[i] += a * x[i];
            }
        }

seems to have eliminated the load from this function entirely.

Perhaps this is not relevant to your intended use case, but I'd still like to let you know.

sinshu commented 4 months ago

@danslobodan Thanks for the info 😊 I don't know much about Unity, but this should be helpful to Unity users.

danslobodan commented 4 months ago

It seems to create significant overhead even on the PC. Would you perhaps consider sharing your intent with the MemoryMarshal implementation - perhaps we could come up with an alternative solution? I'd be happy to contribute.

sinshu commented 4 months ago

The MemoryMarshal implementation is definitely faster in my environment. Does this issue occur exclusively with IL2CPP? If so, adding an #ifdef to revert to a simple for loop could be a solution.

nickgal commented 4 months ago

This was previously mentioned by https://github.com/homy-game-studio/hgs-unity-tone/issues/4 as well.

I use meltysynth with Unity/IL2CPP with the following alteration:

public static void MultiplyAdd(float a, float[] x, float[] destination)
{
    // use older implementation with il2cpp
    // See also: https://github.com/homy-game-studio/hgs-unity-tone/issues/4
#if ENABLE_IL2CPP
    for (var i = 0; i < destination.Length; i++)
    {
        destination[i] += a * x[i];
    }
    return;
#endif

    var vx = MemoryMarshal.Cast<float, Vector<float>>(x);
    ...

sinshu commented 4 months ago

@nickgal Thanks for the info 😊

@danslobodan If you're curious about the speed of the MemoryMarshal.Cast implementation, I have a benchmark result from when I updated the ArrayMath to the current code. The new version speeds up the rendering process by 10% without increasing GC overhead. https://github.com/sinshu/meltysynth-benchmark/commit/2d603e7dd30acd4ea95f34a8c8d4f2b0b6ec9278

danslobodan commented 4 months ago

@sinshu

On the PC, when running the game inside the Unity Editor, my results show

MemoryMarshal with Vector: Total Audio CPU: 28.0 % Simple for loop: Total Audio CPU: 18.8%

So it's a really significant difference. It's much, much worse on the Android, where it pretty much doesn't work at all.

What I saw from the call stack is that about all of the load comes down to the Garbage Collector. I can't say I understand what's happening behind the scene, but something is apparently being allocated and collected, despite seemingly being an allocation free operation.

We could dive deeper if you'd like.

Note that these are the results in Unity, on PC (not using IL2CPP) and Android (using IL2CPP). When not using IL2CPP on Android it actually runs better, but it still strugles.

danslobodan commented 4 months ago

Here's the call stack:

Vector: vector For loop: for_loop

So you can see that the load from the MultiplyAdd function almost vanishes when using for loop, while with the vector it accounts for most of the load.

Edit: Note that is on PC, not Android. On Android the difference is much more drastic than this.

sinshu commented 4 months ago

Thanks for the detailed info 😊

To summarize the current situation:

The MemoryMarshal.Cast implementation is slow in Unity (even slower in IL2CPP).
The MemoryMarshal.Cast implementation is fast in MS's .NET runtime.

What I want:

I want to keep the MemoryMarshal.Cast implementation, as it is faster for my use case.

I've done a bit of research on Unity. I found that there are several ways to add libraries to Unity, not only by copying source code into the project but also by adding compiled DLLs or directly using NuGet packages (right?). This means that a simple #if code switch at compile time might not be suitable.

The problem here is that I don't have the ability to handle Unity's processing system. Regarding the code changes intended for Unity, since I cannot verify their functionality, I should not add such changes to this repository.

For example, how about creating a fork optimized for Unity? I'm thinking of putting a link to that fork in a prominent position in the README for when Unity users discover this repository.

What do you think?

danslobodan commented 4 months ago

Do you think this change is a valid enough reason to fork? I'm not really experienced in the ways of open source, so I'd rather go with your opinion on the matter.

On the other hand, granted it's probably going to be hard to make an implementation that works well on both .net and Unity.

sinshu / meltysynth

ArrayMath.MultiplyAdd on Unity Android #46