Further training strategies

ari-ruokamo commented 4 months ago

Thank you for work! I find this model very interesting.

I ran through the training of small configuration - 200 epochs - (bass, vocals, other, drums) with musdb18hq and the training reached the expected total SDR of appx. 9.2 on musdb18hq 50 track validation set.

Further on, in my high hopes for improving the SDR by expanding the dataset, I have (naively) setup three other/different small datasets (100 tracks with stems/set) for more training. I am using the same musdb18hq 50 tracks for validation. With the first add-on training set, after 40-something epochs later I notice that training keeps on improving steadily but validation SDR- and loss struggles and shows degradation, slowly but surely. Overfitting? What else, I guess a number of things may be wrong in my setup. Is improving the model impossible? What would be your tips to try for improving the model performance even further?

Thanks!

starrytong commented 4 months ago

How many GPUs did you use during training?

Did you lower the learning rate, for example to 0.0001, after adding additional data?

If you want to maintain stable performance, you can continue training by including musdb and other datasets in the training set.

If you want to optimize the model to improve performance, a simple method is to increase the model size, for example, setting dims to [4, 64, 128, 256] (AMP training might be needed).

ari-ruokamo commented 4 months ago

Thanks for getting back.

I currently have 2 x 24GB (RTX 3090) but I think I'll need more because the 200 epoch training took 4 days...

The new datasets contain some musdb18hq tracks - I simply random shuffled the total pool of sources and allocated them into the new 100-piece datasets.

I'll make a re-run with that learning rate initialization.

By AMP you mean automatic precision 16/32-bit selection? Is it easily enabled in Accelerator(?) or altering the code?

starrytong commented 4 months ago

If you only use two GPUs, 200 epochs might be too many and could lead to overfitting. Perhaps 80-100 epochs would be sufficient.

I have placed the AMP training code in another branch. It can significantly reduce GPU memory usage, allowing you to increase the model size or batch size.

ari-ruokamo commented 4 months ago

I can see from the original train run logs that there was a long pause in epochs vs. improvement and the training adjusted & recovered between epochs 161-179 with the best result.

Thanks - there are now couple of routes to experiment while looking for more GPU (3090 fits my budget).

ari-ruokamo commented 3 months ago

Sorry to bother you with my MSS-newbie questions, but here I go: I added a third GPU (all are RTX 3090 24GB) to my setup. Conf is exactly the same as with the 2 GPU setup except the 3rd GPU has been added in the Accelerate's conf.

The configuration (global) batch size is 4 as in the original configuration . I quickly tried that the size 5 is the maximum I can fit in these GPUs, size 6 already gives out-of-memory error.

Anyway, I began the training run from the start just to see the difference. I can see the the speed of processing & training loop is naturally faster but the factual learning/convergence of the model is slower with 3 GPUs, it improves more slowly. Why is that? Any suggestions how to modify/control the learning loop, learning rate?

Thanks again!

starrytong commented 3 months ago

By increasing the batch size, the number of updates per epoch is reduced, which may slow down initial convergence. Simply continue training to compensate for this effect.

amirpashamobinitehrani commented 3 months ago

Hey eveyone,

Thanks for the great work and issue. So I tried to reproduce @ari-ruokamo 's experiment. My configuration is the same (basically using the default config on the main branch). I trained for 200 epochs on 2 x 4090s. I received my best NSDR at Epoch 122 standing at 8.59

Is this an expected outcome? How are you hitting 9.2 dB? Are you by any chance using a bigger model?

Thanks in advance!

ari-ruokamo commented 3 months ago

Hi! Yes my 2-GPU (2 x RTX 3090) run yielded NSDR of 9.2 with the small configuration.

2024-07-16 09:41:34,053 - INFO - Cross validation...
2024-07-16 09:46:28,150 - INFO - Valid Summary | Epoch 179 | Loss=0.1310 | Nsdr=9.201
2024-07-16 09:46:28,151 - INFO - New best valid nsdr 9.2009
2024-07-16 09:46:28,946 - INFO - Learning rate adjusted to 0.00035466088309032284

It seems difficult to get beyond that result. I have recently added a third GPU for speedier training. The system is 3 x RTX 3090, batch size 5 and initial learning rate 0.000625. Current run is past 250 something epochs and the total NSDR has stagnated at 8.8, best at epoch 240:

2024-08-12 00:53:32,024 - INFO - Valid Summary | Epoch 240 | Loss=0.1358 | Nsdr=8.820 | Nsdr_vocals=9.962 | Nsdr_bass=9.081 | Nsdr_drums=10.204 | Nsdr_other=6.035
2024-08-12 00:53:32,024 - INFO - New best valid nsdr 8.8204

Other than that, my system is Ubuntu 24.04, Python 3.10 and the rest is pretty much what requirements.txt says. Nvidia drivers from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:04:00.0 Off |                  N/A |
|100%   62C    P2             356W / 390W |  21529MiB / 24576MiB |     97%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:0B:00.0  On |                  N/A |
| 71%   66C    P2             328W / 350W |  22661MiB / 24576MiB |     99%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:0C:00.0 Off |                  N/A |
| 62%   61C    P2             359W / 390W |  24194MiB / 24576MiB |     99%      Default |
|                                         |                      |                  N/A |

starrytong commented 3 months ago

@ari-ruokamo The previous code had bugs in the validation process, so the results might be incorrect. Moreover, using the validation process to test performance is not accurate since it doesn't incorporate overlap. You may need to use tools like museval for testing after inference, or set the overlap in the validation process to 50% or higher.

if not train:
       estimate = apply_model(self.model, mix, split=True, overlap=0)

ari-ruokamo commented 3 months ago

@starrytong thanks for the info. And it seems my local codebase doesn't include the July 26th validation fix.

Edit: after I rebased my local git copy to match remote, I can see the current training validation total NSDR drops to ~8.2 at the current stage of training.

starrytong / SCNet

Further training strategies #10