25% performance improvement inlarge wavenet models.

I'd like to offer you a piece of work that I have recently been doing for the ToobAmp project. The net benefit: a 25% performance improvement for the most commonly used WaveNet model on ToneHunt at present.

There are a bunch of reasons why you might not want to take this commit (not the least of which is maintainability). And if you take it, I'd like to do some serious cleanup before I push the changes to you.I would completely understand if you don't want the code. At the very least, I feel like I have provided you with a useful reference point for the advantages and disadvantages of avoiding use of MatrixXf's. (+25% performance, significant maintainance burden).

The basic premise: If wavenet can use fixed-size matrixes (Eigen:Matrix<float,M,N>) instead of MatrixXf matrixes, the the compiler can do a much better job of applying SIMD vectorization to Eigen matrix operations used by the Wavenet model -- principally not having to generate code branches to deal with non-multiple-of-four matrices, and loop-unrolling of 8-row and 16-row matrix code as appropriate. (Pretty much all of Wavenet's matrices have 8-row or 16-row matrices, with one 32-row exception).

The code uses templates extensively. This allow fixed-buffer-size implementations of a variety of different model configurations. If the available templated implementations do not match the requested model, then the old code path is used to generate the model.

The pluses:

My application runs on a Raspberry Pi 4. A 25% speed improvement makes a SUBSTANTIAL difference to my users, reducing CPU use from 47% to 35%, which allows my users to include a second Wavenet model in their signal chain if they so desire.
Oddly, it totally cures the malloc problem. Temporary matrixes are now fixed size, so they get allocated on the stack instead of the heap (Matrix<float,16,32> seems to be the largesy temporary, so, for the wavenet model, this is entirely practical.
The minuses:
Maintainability: essentially two complete implementations of Wavenet, one templated, one not.
requires assembly of input and output into 32-sample frames, which introduces a 0.9ms delay in the signal chain when running on hosts that can't guarantee power-of-two non-varying audio frame sizes.

Sounds complicated, but I think you'll catch on to what's happening if you look at the code. You can take a look at the code here.

Diversion of the default DSP code for wavenet is handled with the following code:

      wavenet::WaveNetFactory_T<8,16,3> wavenet_factory; // CHANNELS=8, HEADS=16, 
      if (wavenet_factory.matches(layer_array_params))
      {
        try {
          bool noBufferFlipRequired = (
            minBlockSize == maxBlockSize && minBlockSize != -1 && minBlockSize >= 32 && isPowerOfTwo(maxBlockSize)
          );
          auto result = wavenet_factory.create(layer_array_params,head_scale,with_head,weights,expectedSampleRate,noBufferFlipRequired);
          if (haveLoudness)
          {
            result->SetLoudness(loudness);
          }
          &c.
   } else {
    ... create a dynamic wavenet model using current code...
  }

So it's trivial to created fixed-matrix-size implementations for other common wavenet model configurations (if there are any).

By way of risk mitigation, I do have a unit test that compares outputs of Templated implmentations against the MatrixXf-based wavenet implment, to esure that they produce exactly the same output (as well as the complete DPS, templated, and non-templated).

The first step would be to decide whether you want this code. A 25% performance improvement makes the maintainance headach worthwhile for me. Perhaps less so for you. I would completely understand that.

If you do want to take it, the next step would be for me to clean up the code a bit more before sending it to you. The current implementation preserves arguments of the original wavenet classes and methods, even though the same information is available through template type parameters. Doing so made it much easier to to verify that I had correctly configured the sizes of hard-coded matrices. It's ugly. But probably safer. We could have a discussion about that.

Let me know how you'd like to proceed.

Kindest Regards,

sdatkinson / NeuralAmpModelerCore

25% performance improvement inlarge wavenet models. #126