Open 761347 opened 2 months ago
I think it is difficult to answer without more information. SoundStream tries to produce audio with close spectrogram. How do you measure your error? Does the model learn audio but still with large noise? Or is it complete random? If it is not random, probably you can start adjusting parameters, if it is random, then you may check if it is giving the data correctly. You may try my pretrained model with your data to see if its loss is different from yours.
Hi kaiidams,
First of all, thank you for your quick response. I am measuring the error in a similar way as you do, using the same metrics. The output audio is noise, changing the frequency and the amplitude, but in essence is noise. I have created a repository where you can listen to the input and output audio https://github.com/761347/audio
If I use your pretrained model directly happens something similar to using the modified models.
I have tried several processes:
In both cases the predominant loss term is the Reconstruction Loss (g_rec_loss).
I apologize for the uncertainty, it is the first time I work with something of this subject. Thank you in advance!!
ORIGINAL_AUDIO.wav
has very low signals < 0.03, while the model accepts normalized audio as inputs
The code below produces noisy sound followed by laughter. I think the noisy sound is mainly because the model was only trained with speech. Probably you could try normalize the audio before giving it to your model?
import torchaudio
import torch
model = torch.hub.load("kaiidams/soundstream-pytorch", "soundstream_16khz")
x, sr = torchaudio.load('/content/ORIGINAL_AUDIO.wav')
x, sr = torchaudio.functional.resample(x, sr, 16000), 16000
x = 0.9 * x / x.max()
with torch.no_grad():
y = model.encode(x)
# y = y[:, :, :4] # if you want to reduce code size.
z = model.decode(y)
torchaudio.save('output.wav', z, sr)
Thank you so much for your quickly. The mistake I was making was exactly the one you are talking about. My new dataset normalize the audio but when I applied the model to the audio i wasn't doing it. Thank you very much!
Now I'm trying to train the network but I can't get rid of the noise. I have tried different accuracies and training rates (9e-3 to 1e-6). Can you think of anything I can try to improve it? I am training the network with a single audio.
You can see the results in the following images: Codes entropy Num replaced G rec loss
These are numbers for LIBRISPEECH.
g_stft_loss | g_wave_loss | g_feat_loss | g_rec_loss | q_loss | g_loss | codes_entropy | d_stft_loss | d_wave_loss | d_loss | num_replaced | epoch | step |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8.765625 | 2.03125 | 0.035614 | 13.462036 | 0.385002 | 20.735474 | 6.826962 | 0.0 | 1.387695 | 1.041016 | 0.0 | 24 | 21487 |
Spikes of entropy in your case is expected it jumps when some of codes are replaced. Rec loss is flat after 1.5k, which is higher than mine. What is the total time of training audio? If it is too short, it may stop learning soon after steps.
Hi kaiidans!!
I'm trying to use your code as a starting point to use as input 'non-music', 'non-speech' audios, but I'm new to neural networks. In my first approximations the error remains very high, and I can't train it correctly. Do you have any advice on how I can continue? Any ideas you have will help me to keep looking for a solution. Although I know that the network is not originally intended for this kind of audios, I think I could end up adapting it.
Thank you very much in advance!
Adam