Closed rerdavies closed 1 month ago
Well-spotted!
First reaction: this may explain #101.
Real quick:
The fix DOES apply to the Wavenet model used by most models on ToneHunt.
Actually, I don't think so: The "standard" WaveNet configuration uses "gated": false
(see: https://github.com/sdatkinson/neural-amp-modeler/blob/4c982e5b530cd9c849b2b55c0f0da1cf1dab8cdf/nam/train/core.py#L806 etc), so the branch that's used is https://github.com/sdatkinson/NeuralAmpModelerCore/blob/cd92997f401105aefcf83cb3a6fbdecc14a1ee0d/NAM/wavenet.cpp#L36
which is ok.
Check out #131 and lmk if you agree that this resolves the issue of incorrectness. (No guarantees on efficiency, but looking at the CPU usage of NeuralAmpModelerPlugin with this it's ballpark what I'd expect.)
I'm happy to take a separate Issue & PR if you have suggestions to improve performance.
Yes, that resolves the issue.
Performance impact: I feel duty bound to tell you that there is a significant performance risk, although I, personally, would be inclined to leave it until an actual performance issue comes up.
This is not on the 99% code-path. There is a potential fix, which might provide a 30-50% improvement in a model that actually turns on gating (which is going run twice as slowly anyway). The fix would not improve performance in the 99% case.
This is hotspot code. For the non-gated case, ~13.8% of total execution time is spent in activations::ActivationFastTanh::apply. of which 7.4% gets spent doing the fastTanH function (which is properly vectorized). (80% of total execution time is spent in _Layer::process). Since you asked, and I happened to know. :-) It didn't quite make my cut-list for optimization targets that are worth pursuing. But there's something odd going on there.
It's hard to account for the 6.4% non-SIMD overhead . The profiling is for non-gated code, so we're talking about 8x n-frames and 16x n-frames matrices. But there's still a LOT of non-SIMD code getting executed there. The risk is that the overhead will get much worse if wavenet is passing 16x1 data.
The fix would be to add the following function to the activation interface:
virtual void activation::apply(Eigen::Block block).
which would allow compilers to vectorize much more efficiently. Currently there's a virtual function call in the loop (activation::apply), across which compilers cannot optimize. The "is it a multiple of 4" check would get run once for each 8x_num_frames or 16x_num_frames block, instead of once for each 8x1 or 16x1 column array.
If you have an actual model that actually uses gating, I can provide you with a quick answer as to whether there's a performance problem if you like.
Thanks; this is great info.
I'd be happy to take PRs to improve the code if you (or I, or anyone else) has bandwidth, but it's not critical to me at the moment.
Thanks again for the original issue; it gave me the hint that got me to resolve #101 🎉
Marking as resolved since they're no longer incorrect; tracking efficiency improvements in https://github.com/sdatkinson/NeuralAmpModelerCore/issues/133
I'm sorry I'm unable to push this change. I have an existing fork, which I'm using as a submodule in PiPedal, and the submodule is fragile, to say the least. Were it not for that, I'd push to a clean fork. My fork has diverged to the point where I'm not entirely comfortable pushing everything I have to you. But I think I you need this commit. And github won't let me create a second fork.
Here's the commit for the code. Let me know if there's a better way to do this. There may be some way to pull on this specific commit; but I am unfortunately a git illiterate. If there's a better way to do this, let me know.
https://github.com/rerdavies/NeuralAmpModelerCore/commit/102968f23560b11bedc940fcafb28cfbe76f21ac
The issue:
Wavenet.cpp
makes the following call:wavenet.cpp: 50
(and a 2nd similar call, a couple of lines further on). Because this is a row-based Eigen:Block, the rows and columns of the block are NOT contiguous in memory, so the existing overload (the first in the original code) must apply the activation column by column. As written, the pattern in which the activations are applied are pretty non-sensical.
There are six other occurrences of calls to activation->apply, but all of them use column-based blocks, whose data is stored in contiguous memory. As a result, they take the second overload of_activation->apply (in the original code), so they all do the right thing. So the revision does not change the behaviour of code in other models.
I propose the change rather hesitantly. I think you need to check to see whether the training code for wavenet replicates the bug. If the models are TRAINED with this bug, then applying this fix would probably be a bad thing.
I did try various Wavenet models to see if the fix makes a difference. Curiously, it does not change the sound of the amp models significantly, as far as I can tell. The fix DOES apply to the Wavenet model used by most models on ToneHunt. So no audible difference in sound output; but the fix is definitely correct.