sdatkinson / NeuralAmpModelerCore

Core DSP library for NAM plugins
MIT License
298 stars 59 forks source link

25% performance improvement inlarge wavenet models. #126

Open rerdavies opened 3 weeks ago

rerdavies commented 3 weeks ago

I'd like to offer you a piece of work that I have recently been doing for the ToobAmp project. The net benefit: a 25% performance improvement for the most commonly used WaveNet model on ToneHunt at present.

There are a bunch of reasons why you might not want to take this commit (not the least of which is maintainability). And if you take it, I'd like to do some serious cleanup before I push the changes to you.I would completely understand if you don't want the code. At the very least, I feel like I have provided you with a useful reference point for the advantages and disadvantages of avoiding use of MatrixXf's. (+25% performance, significant maintainance burden).

The basic premise: If wavenet can use fixed-size matrixes (Eigen:Matrix<float,M,N>) instead of MatrixXf matrixes, the the compiler can do a much better job of applying SIMD vectorization to Eigen matrix operations used by the Wavenet model -- principally not having to generate code branches to deal with non-multiple-of-four matrices, and loop-unrolling of 8-row and 16-row matrix code as appropriate. (Pretty much all of Wavenet's matrices have 8-row or 16-row matrices, with one 32-row exception).

The code uses templates extensively. This allow fixed-buffer-size implementations of a variety of different model configurations. If the available templated implementations do not match the requested model, then the old code path is used to generate the model.

The pluses:

Sounds complicated, but I think you'll catch on to what's happening if you look at the code. You can take a look at the code here.

Diversion of the default DSP code for wavenet is handled with the following code:

      wavenet::WaveNetFactory_T<8,16,3> wavenet_factory; // CHANNELS=8, HEADS=16, 
      if (wavenet_factory.matches(layer_array_params))
      {
        try {
          bool noBufferFlipRequired = (
            minBlockSize == maxBlockSize && minBlockSize != -1 && minBlockSize >= 32 && isPowerOfTwo(maxBlockSize)
          );
          auto result = wavenet_factory.create(layer_array_params,head_scale,with_head,weights,expectedSampleRate,noBufferFlipRequired);
          if (haveLoudness)
          {
            result->SetLoudness(loudness);
          }
          &c.
   } else {
    ... create a dynamic wavenet model using current code...
  } 

So it's trivial to created fixed-matrix-size implementations for other common wavenet model configurations (if there are any).

By way of risk mitigation, I do have a unit test that compares outputs of Templated implmentations against the MatrixXf-based wavenet implment, to esure that they produce exactly the same output (as well as the complete DPS, templated, and non-templated).

The first step would be to decide whether you want this code. A 25% performance improvement makes the maintainance headach worthwhile for me. Perhaps less so for you. I would completely understand that.

If you do want to take it, the next step would be for me to clean up the code a bit more before sending it to you. The current implementation preserves arguments of the original wavenet classes and methods, even though the same information is available through template type parameters. Doing so made it much easier to to verify that I had correctly configured the sizes of hard-coded matrices. It's ugly. But probably safer. We could have a discussion about that.

Let me know how you'd like to proceed.

Kindest Regards,

sdatkinson commented 3 weeks ago

Thanks for this!

I agree wrt the value of this--many NAMs are the same "standard WaveNet", and optimizing their operation would impact "99%" of all NAM processing that's going on.

I need to find some time to look over this more closely but I'm definitely interested, so, again, thanks 🙂