[FEATURE] More input samples for LSTM architecture and Use of ESR+DC as loss fuction

KaisKermani commented 1 year ago

Hello nice people!

Context

Lately I've been running some tests on the NAM models with the goal of improving the training procedure, optimizing the CPU consumption of the generated models, and ultimately make NAM more accessible on embedded devices. However I believe the findings that I'm sharing here will be useful beyond just NAM on embedded devices ^^

Data

So, I trained different models (LSTMs and WaveNet) using two platforms (NAM and AIDA DSP). I then run the models on test set where none of them was trained. For training datasets I used 10 different captures that we did at MOD audio of multiple amps (fender blues deluxe, marshall jvm, orange rockerverb..) and devided the captures into 3 categories: clean, crunchy, and high_gain.

When doing the evaluation, I made sure to account for:

Time alignment with the target, that gives the best results for each model.
Gain correction that gives the best results.
DC correction that gives the best results.

The different models are: (CPU consumption values on an ARM board with a 1.3GHz CPU)

AIDA LSTM Stdrd (1x16), uses 34% CPU
NAM LSTM Lite (2x8), uses 38% CPU
NAM WaveNet Nano, uses 67% CPU
new LSTM (not implemented yet, but I estimate it to run between 40-50% CPU) more on this later

newplot (25) newplot (26)

Interpretation

First thing I notice is that AIDA LSTM Stdrd is outperforming NAM LSTM Lite (even though the NAM LSTM Lite uses slightly more CPU than AIDA LSTM Stdrd), particularly in the high gain datasets (like the JVM OD, and Blues Deluxe Gainy datasets). Same for NAM WaveNet Nano and "new LSTM".

IMO, the reason for that is mainly the loss function used in the training. As far as I know NAM uses MSE loss in the training, wheras AIDA uses ESR and DC Losses which account for the "energy" in the target signal ESR(ouput, target) = MSE(output, target)/(target^2) This also makes sense as high gain datasets have more "energy" in the signal than clean dataset.

Second thing is that the "new LSTM" architecture has very comparable results to the NAM WaveNet. It even outperforms it in the high gain datasets.

The "new LSTM" architecture is based on the idea of giving more than 1 sample as input to the LSTM Layer. It's this: LSTM(input_size=8, hidden_size=8, num_layers=2) -> Linear layer (no bias) Which is basically just like the NAM LSTM Lite (2x8), with the exception of using an input_size of 8 instead of 1. You can see from the model evaluations that this little tweak makes a big difference in the end results!

Conclusions and suggetions

I'm posting this to give a motive here to test both:

ESR+DC Loss (like suggested in this paper https://arxiv.org/pdf/1911.08922.pdf), which I believe can improve the quality of the models especially in high gain territory.
Using more input samples with the LSTM architectures, particularly testing this architecture LSTM(input_size=8, hidden_size=8, num_layers=2) -> Linear layer (no bias) and ideally make it available in the NAM easy training. As it provides very similar sound quality to the WaveNet Nano for a far lower CPU consumption, which is very valuable on embedded devices like the MOD Dwarf. (from 67% to <50% CPU allows for the use of more effects + higher quality IR cabinets in the pedal chain for example)

I hope this is insightful and can help drive the project in a good direction ^^

PS:

Here's a link to a colab notebook where I run the code for testing and comparing all the outputs of the models. https://colab.research.google.com/drive/1YuVzm6E_KoRwK8AQqWWkm3KxHFj0OeNE?usp=sharing
The way we generated the output of the models was using the MOD platform, where both AIDA-X and NAM run. We can create a dockerfile to automate the generation process if someone has the need for it

mikeoliphant commented 1 year ago

@sdatkinson Any tips on how to best go about allowing a larger receptive field for LSTM models (right now it is hardcoded to 1)? There is an "input_size" property, but that seems to be for adding additional parameters?

sdatkinson commented 1 year ago

Thanks for your very thorough Issue, @KaisKermani 🙂

I see two different topics in here:

Using ESR and DC terms in the loss function, and
making an architecture that pulls more samples into the input of the LSTM.

As far as the first, it's already implemented--see dc_weight in the LossConfig class. I recall checking ESR as a training loss way back, but ultimately decided against using it because (1) it's not an IID loss since the normalizing factor depends on the contents of the rest of the set of samples being regressed against, and (2) the benefit of using it wasn't clear empirically. i.e.

IMO, the reason for that is mainly the loss function used in the training.

it'd be good to demonstrate this by ablating that factor specifically--either train the AIDA model w/ MSE, or train the NAM model with ESR. For NAM, you could do this in a pinch by using ._esr_loss() instead of ._mse_loss() here. Or, you could make changes to LossConfig to make it more "official" 🙂 . Not sure about AIDA.

As far as the second topic, I'm tracking that over in #289. I believe I've got some code sitting around somewhere that does this that I whipped up way back out of curiosity--just needs some additions over here as well as in NeuralAmpModelerCore.

As far as action items, how about this: check the ablation I asked for, and if it looks good, then I'll take a PR to include ESR as an option for the training loss 👍🏻

sdatkinson commented 1 year ago

@mikeoliphant

Any tips on how to best go about allowing a larger receptive field for LSTM models (right now it is hardcoded to 1)? There is an "input_size" property, but that seems to be for adding additional parameters?

That's mostly right--1 is the default, but other values work as well. CatLSTM uses this for e.g. a parametric model.

KaisKermani commented 1 year ago

@sdatkinson I'll do comparisons specifically on the loss function (mostly using the AIDA training, I'm just more used to the code base), and share the results here.

Regarding the issue #289 it looks like it's a different thing. Here I'm not suggesting to add a convolution layer before the LSTM layer. As far as I tried, this architecture (conv -> lstm) doesn't make a significant improvement. What I'm suggestig here, is to directly input the samples to the LSTM layer, which should be simpler in terms of AI architecture. (so not using a convolution layer at all). i.e. each iteration, the lstm layer takes as input the most recent 8 samples, instead of the most recent 1 sample.

I believe this can even be scaled up for larger LSTM models as well, so that model Input could be 32 samples for example. This may potentially give better results than the WaveNet architecture. This is actually what @GuitarML finds out as well when he was experimenting with the same subject. (in this article)

KaisKermani commented 1 year ago

@sdatkinson here's the comparison of the loss functions. The two sets of models have been trained with the same exact parameters (model architecture, datasets, epochs...) except from the loss function.

Note that ESR+DC models have been trained using ESR+DC losses with these coefficients (which experimentally turned out to be the most efficient): {'ESR': 0.75, 'DC': 0.25}

newplot (29) newplot (30)

Here it's clear how the ESR+DC loss function helps the training converge more easily to the optimal solution. Note also that the gap between the MSE and ESR+DC subsets of models is more clear in the high_gain territory of sounds.

I believe that these changes (both 1.increasing the receptive field of the LSTM, and 2.changing the loss function to ESR+DC) will directly improve the quality of NAM models (sound quality and CPU consumption).

yovelop commented 1 year ago

Thanks for sharing your tests. Is this fast results or deep trained models? What settings of lr, lr decay and epochs number was used fr this tests?

ESR itself as a loss function depends on batchsize, on most my cases it gives worse results than mse and mae. But when ESR with little weight comes with MSE, ME or DC - it can give some extra accuracy, but not much. On fast trainig (500-1000epochs) there may be some difference but when we talk about hours of training difference in resulted ESR,MSE,MAE error is minimal.

KaisKermani commented 1 year ago

Hey @yovelop ^^

What settings of lr, lr decay and epochs number was used fr this tests?

Models were trained on the same datasets, for 150 epochs I believe (or 100) with Adam optimization algorithm at lr .01 with no lr decay.

On fast trainig (500-1000epochs) there may be some difference...

Well the results I shared they show that there is indeed a difference in the results by changing the loss function (which makes sense to me). You're welcome to try the same yourself ofc! I also experimented with multiple other loss functions that are time domain based (like SNR) and frequency domain based (like MultiResolutionSTFTLoss), but the combination of ESR+DC loss seems to bring the best results within 200 epochs of training. Note that when I say ESR+DC, ESR is weighted 0.75 and DC is weighted 0.25.

Unless you're running your custom training script, training NAM models usually train from around 200 epochs. Same goes for AIDA DSP models (another platform for neural modeling). And this makes sense especially if we're exposing training scripts for all users so that they're no stuck training a snapshot for an AMP for hours.

Having a faster converging training isn't a bad idea afterall.

ESR itself as a loss function depends on batchsize

I don't see how ESR depends on batchsize (contrarily to MSE ?). Just for reference this is the formula for ESR we're both talking about right? ESR(ouput, target) = mean[(target - output)^2]/mean(target^2) = MSE(output, target)/mean(target^2)

yovelop commented 1 year ago

for 150 epochs I believe (or 100) with Adam optimization algorithm at lr .01 with no lr decay.

Thanks. So your scores, test and conclusions are only for extra-fast training situations.

on very small num of epochs you can try to use RAdam - on my tests it learnings faster in small epochs (~300-500ep) but then slowing down (comparing with Adam and NAdam).

I don't see how ESR depends on batchsize (contrarily to MSE ?). Just for reference this is the formula for ESR we're both talking about right? ESR(ouput, target) = mean[(target - output)^2]/mean(target^2) = MSE(output, target)/mean(target^2)

Yes, i mean that this part: "/mean(target^2)" on small batches make sense, cause some batches can contain only loud parts ater other batches contain only extra quite signal parts. On big batches it varies less.

I will try to find and share my prev tests of loss functions, and test of different settings of WaveNet for light CPU use (less than NANO models). but there often some architectures learn quickly and win strongly after 100-200 epochs, but the final errors after 5000 epochs can be much better for models who were outsiders in 100-200 epochs

p.s. it's better to compare not with same epochs but with same trainig time. I mean if some models trains faster so it 10 min can train 250ep other trains slowly and will be only 120 ep. For example, like "relative" horisontal axis in tensorboard where you can compare different trainings in the same time dimension (look at the step and relative columns):

38github commented 11 months ago

Maybe you know this already but I would like to share something I discovered. Maybe it can be of help to others.

Updated: lstm_tests_2_num_layers_list.pdf

38github commented 11 months ago

Is there any way to try this out by tweaking or is it not supported right now?

38github commented 11 months ago

For NAM, you could do this in a pinch by using ._esr_loss() instead of ._mse_loss()

I tried this with LSTM just to see what would happen and no epochs went below ESR 1.000.

sdatkinson commented 11 months ago

@38github

For NAM, you could do this in a pinch by using ._esr_loss() instead of ._mse_loss()

I tried this with LSTM just to see what would happen and no epochs went below ESR 1.000.

Interesting. I didn't expect it to be that much worse. But this is why I need to see the argument in terms of NAM's code base, not others'. There are plenty of tiny decisions made along the way, and it's not enough to say that some change works with someone else's codebase.

I realize that I made a mistake when I said that

it'd be good to demonstrate this by ablating that factor specifically--either train the AIDA model w/ MSE, or train the NAM model with ESR.

The mistake is that it's really not good enough to demonstrate this with AIDA, because this is an Issue about what to do with this codebase. I really need to see compelling evidence that it's better here, because that's where it would be used.

Coming back again to the second part of the Issue (and for the record I'd really like to see these two things handled separately; this is already a very busy thread), I've included the ability to "register" new model architectures with PR #310.

So for example, you could do something like this at the top of bin/train/main.py:

from nam.models._base import BaseNet

class MyNewModel(BaseNet):
    # Implement...
    def __init__(self, num_input_samples: int):
        # Etc
        ...

# Register it!
from nam.models.base import Model
Model.register_net_initializer("MyNewModel", MyNewModel)

And this allows you to use your model by adapting the model JSON of the CLI trainer like this, e.g. (note the "net" section):

{
    "net": {
        "name": "MyNewModel",
        "config": {
            "num_input_samples": 16
        }
    },
    "loss": {
        "val_loss": "mse",
        "mask_first": 4096,    
        "pre_emph_weight": 1.0,
        "pre_emph_coef": 0.85
    },
    "optimizer": {
        "lr": 0.01
    },
    "lr_scheduler": {
        "class": "ExponentialLR",
        "kwargs": {
            "gamma": 0.995
        }
    }
}

This allows you to quickly implement new models without having to change the nam package itself. If you want to take a swing at implementing the model, you could easily share it as a code snippet and that'd make it a lot easier to vet the idea.

Alternatively, notice that this "plugin-style" feature basically gives you a ton of power to customize the NAM trainer without even needing to fork. So, you're more than welcome to personalize this package yourself in that way. (But you are going to need to implement the changes in the plugin code as well...and for that matter, I shouldn't accept a new model over here without an accompanying plan to make it available in NeuralAmpModelerCore. Otherwise, it wouldn't really make sense for this project! 🙂)

So hopefully this helps illuminate things. This (the model part) is admittedly a rather involved ask because of how many things it touches, and there's a fair bit of responsibility with making sure that it all works given how widely-used this repo is.

So @KaisKermani here's my suggestion for next steps here:

Pick what specific thing you want this Issue to be about and let's narrow the scope to it and leave the others for something separate. 2a. If it's the loss, try to figure out how to make it actually competitive in NAM. I suspect we'll learn a lot by seeing it work here instead of in AIDA. 2b. If it's the architecture, code it up and use the registry functionality to give it a spin. If it works, report back and we'll take it from there.

Sound good?

sdatkinson / neural-amp-modeler