Adapt the network to another inputs

kaiidams / soundstream-pytorch

Unofficial SoundStream implementation of Pytorch with training code and 16kHz pretrained checkpoint

MIT License

48 stars 9 forks source link

Adapt the network to another inputs #3

Open 761347 opened 2 months ago

761347 commented 2 months ago

Hi kaiidans!!

I'm trying to use your code as a starting point to use as input 'non-music', 'non-speech' audios, but I'm new to neural networks. In my first approximations the error remains very high, and I can't train it correctly. Do you have any advice on how I can continue? Any ideas you have will help me to keep looking for a solution. Although I know that the network is not originally intended for this kind of audios, I think I could end up adapting it.

Thank you very much in advance!

Adam

kaiidams commented 2 months ago

I think it is difficult to answer without more information. SoundStream tries to produce audio with close spectrogram. How do you measure your error? Does the model learn audio but still with large noise? Or is it complete random? If it is not random, probably you can start adjusting parameters, if it is random, then you may check if it is giving the data correctly. You may try my pretrained model with your data to see if its loss is different from yours.

761347 commented 2 months ago

Hi kaiidams,

First of all, thank you for your quick response. I am measuring the error in a similar way as you do, using the same metrics. The output audio is noise, changing the frequency and the amplitude, but in essence is noise. I have created a repository where you can listen to the input and output audio https://github.com/761347/audio

If I use your pretrained model directly happens something similar to using the modified models.

I have tried several processes:

Training the neural network from scratch. I get losses around 160
Perform a finetunning process by unfreezing more or less layers of all the elements. In this cases I obtain losses of around 20, it is similar to another number os layers unfreezing.

In both cases the predominant loss term is the Reconstruction Loss (g_rec_loss).

I apologize for the uncertainty, it is the first time I work with something of this subject. Thank you in advance!!

kaiidams commented 2 months ago

ORIGINAL_AUDIO.wav has very low signals < 0.03, while the model accepts normalized audio as inputs

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L606

The code below produces noisy sound followed by laughter. I think the noisy sound is mainly because the model was only trained with speech. Probably you could try normalize the audio before giving it to your model?

import torchaudio
import torch

model = torch.hub.load("kaiidams/soundstream-pytorch", "soundstream_16khz")
x, sr = torchaudio.load('/content/ORIGINAL_AUDIO.wav')
x, sr = torchaudio.functional.resample(x, sr, 16000), 16000
x = 0.9 * x / x.max()
with torch.no_grad():
    y = model.encode(x)
    # y = y[:, :, :4]  # if you want to reduce code size.
    z = model.decode(y)
torchaudio.save('output.wav', z, sr)

audio_pretrained_normalized.zip

761347 commented 2 months ago

Thank you so much for your quickly. The mistake I was making was exactly the one you are talking about. My new dataset normalize the audio but when I applied the model to the audio i wasn't doing it. Thank you very much!

Now I'm trying to train the network but I can't get rid of the noise. I have tried different accuracies and training rates (9e-3 to 1e-6). Can you think of anything I can try to improve it? I am training the network with a single audio.

You can see the results in the following images: codes_entropy Codes entropy num_replaced Num replaced g_rec_loss G rec loss

kaiidams commented 2 months ago

These are numbers for LIBRISPEECH.

g_stft_loss	g_wave_loss	g_feat_loss	g_rec_loss	q_loss	g_loss	codes_entropy	d_stft_loss	d_wave_loss	d_loss	num_replaced	epoch	step
8.765625	2.03125	0.035614	13.462036	0.385002	20.735474	6.826962	0.0	1.387695	1.041016	0.0	24	21487

Spikes of entropy in your case is expected it jumps when some of codes are replaced. Rec loss is flat after 1.5k, which is higher than mine. What is the total time of training audio? If it is too short, it may stop learning soon after steps.