kaiidams / soundstream-pytorch

Unofficial SoundStream implementation of Pytorch with training code and 16kHz pretrained checkpoint
MIT License
54 stars 10 forks source link

How to train a new set of data? #1

Open a897456 opened 10 months ago

a897456 commented 10 months ago

Thanks for your code, but I want to learn how to use your modle to train a new set of data, so can you provide a train.py file?

kaiidams commented 10 months ago

Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case, you can run python soundstream.py that should download LIBRISPEECH under ./data and start training.

a897456 commented 10 months ago

Thank you for your reply. First, I found that your soundstream models need to download data, including YESNO, LIBRISPEECH or librispeech, which is actually very time-consuming, so I downloaded other new data in advance. Second, I mean the first case, I want to use your soundstream modle to train a new set of data with a sample-rate of 8KHz which I have already downloaded, but I don't know how to load them into your model.

a897456 commented 10 months ago

My ultimate goal is to achieve low bit rate compression. I would like to train a set of data with a sample rate of 8KHZ through your model, then num_embeddings change from 1024 to 256 and num_quantizers from 8 to 6, and see what the end result is.

a897456 commented 10 months ago

First image Then image image

kaiidams commented 10 months ago

ds is not a string of directory path, but torch.utils.data.Dataset. If you want to train 8kHz model with LIBRISPEECH, you can change sample_rate. If you want to your custom dataset, you can implement your own Dataset which should not be too difficult.

a897456 commented 10 months ago

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before? image

a897456 commented 10 months ago

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before? image

image image I tried to continue training from a location with an epoch of 98 and found no errors. This issue is temporarily considered evaded The second question, it seems that the testing process for the final model has not been found. Can you provide guidance?

kaiidams commented 10 months ago

If you just want to hear the output yourself. You can encode the audio file by calling forward() method. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500 . If you want to compute ViSQOL, sorry. It has no implementation for that.

a897456 commented 10 months ago

ViSQOL

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500

. If you want to compute ViSQOL, sorry. It has no implementation for that.

Firt, so how do you determine your model is useful? What are the judgment indicators? Second, how to use the output such as "epoch=84-step=150000.ckpt" to check the availability of the model?

kaiidams commented 10 months ago

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
a897456 commented 10 months ago

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

image image First,I think I have completed 150 training sessions, as shown in the picture. Second, what you mentioned that "model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")", is it possible to replace the. ckpt file labeled in the second image to reconstruct my speech signal, in order to verify the usefulness of the model? Thirdly, is the. ckpt file labeled in the second figure the final training model? I'm sorry, I'm a novice, so there may be many ignorant questions bothering you

a897456 commented 10 months ago

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 image Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

a897456 commented 10 months ago

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 image Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

a897456 commented 10 months ago

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 image Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

Firstly, I'm sorry, please forgive me for being a novice. I didn't have access to open audio files earlier, and you can now access them. Secondly, a few days ago, when the epoch was 84 and an unknown error occurred during the training process, the model generated at that time and the model generated after 150 epochs of training had a significant impact on the output file. Is the epoch too large to handle?

kaiidams commented 10 months ago

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 4 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

a897456 commented 10 months ago

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 4 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

First, I studied TensorBoard for several hours today, but I haven't made any progress yet. My understanding is this: when using TensorBoard, I train with the model.fit() function, but for now I train with the pl.Trainer.fit() function. Do I need to change the training function if I want to use TensorBoard? How should I use TensorBoard. Second, you mentioned that "So 8kHz model has to compress a longer audo window into an embedding", I want to change the window size with 225*8=160. Am I understanding this correctly?

kaiidams commented 10 months ago

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

225*8=160 makes window size 160. I think it is good number.

a897456 commented 10 months ago

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

2_2_5*8=160 makes window size 160. I think it is good number.

First,After seeing your reply, I spent several hours trying to open TensorBoard and found that just setting logger=True would suffice. I'm so happy image image Second, If I want to achieve a low bitrate compression method, such as 1.2kbps, such as an 8KHz sampling rate, if I use a window size of 320, 1200 320/8000=48bit, and use six 8-bit codebooks for quantization. If we still use six 8-bit codebooks to achieve 0.6kbps, we need 600 640/8000=48bit, which means the window size has changed from 320 to 640. So I seem to need to increase the window size, do you agree?

kaiidams commented 10 months ago

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241 Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.

I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

a897456 commented 10 months ago

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost. I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

a897456 commented 10 months ago

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost. I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

    [self.register_buffer("code_count", torch.empty(num_quantizers, num_embeddings))](https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L225)

For the second question I mentioned yesterday, I have some new ideas, and I do not know if it is correct. The bit is num_quantizerslog(num_embeddings)=80bit, the code rate is 80bit(16000Hz/320=50frames)=4kbps?Am I right?

kaiidams commented 10 months ago

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 10 50 = 2.5kbps.

Here, you can pass n codes, where n is between 1 and 8 in the inference time. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L264

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

a897456 commented 10 months ago

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 10 50 = 2.5kbps.

Thank you for your reply. Your reply is my motivation to continue studying. My understanding is this: I just load your pre-trained model (soundstream_16khz-20230425.ckpt),and then change the value of n, I can achieve a variety of bit rate compression, no need to repeat training, such as n=4, 4 10 50 = 2kbps; n = 2, 2 10 50 = 1kbps;

Here, you can pass n codes, where n is between 1 and 8 in the inference time.

I want to achieve a lower speech compression bit rate. by change the sampling rate to 8KHz(should be the lowest); change the step size to 2 4 5 * 6 = 240, which corresponds to the sample rate of 8KHz. 'num_quantizers=8' and 'num_embeddings =1024' remain unchanged, epoch=200. Then compare the results with your 16KHz model by change 'n' synchronously.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L264

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.

Can the value of n equal 1 which just keep only the most important code? https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

a897456 commented 10 months ago

When I change the step size 2, 4, 5, 6=240, I need to change the segment_length from 32270 to 30430, which I calculated in order to be able to divide the steps exactly. So I would like to ask if the 32270 you set at that time is also to divide the step size? Can I make it bigger or smaller? I wonder if it could be bigger? Because it can include more X content, am I right? image

a897456 commented 10 months ago

I am a PhD student and I want to post an article based on soundstream, but I haven't found any innovation yet, can you guide me something about soundstream? For example, where can I continue to improve soundstream?

At first I wanted to use soundstream to achieve lower bitrates, but I found that soundstream had already implemented it by changing 'n', or retraining the new 'num_quantizers' and' num_embeddings', so I couldn't find a new idea, can you remind me something?TKS

a897456 commented 10 months ago

image image In the paper, the authors mentioned that the coding rate is guaranteed to remain the same, and different step sizes will not affect the final score.

So my idea of retraining a new model to achieve a lower bit rate by changing the step size, 2 4 5 * 6 = 240, might not work.

kaiidams commented 9 months ago

Your reply is my motivation to continue studying.

Thank you! I'm glad to hear that.

I need to change the segment_length from 32270 to 30430,

32270 is nice number so that the output lenght of decoder is the same as the input length of encoder. They are sometimes different because of rounding. I think 30430 is good number for 2 4 5 6.

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

a897456 commented 9 months ago

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500

. If you want to compute ViSQOL, sorry. It has no implementation for that.

https://github.com/aliutkus/speechmetrics I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?

a897456 commented 9 months ago

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

Can you be more specific? Because I am a beginner in audio compression and my research direction is very low bit rate compression, I feel that you are an expert in this field, so I would like to hear your specific opinion.

a897456 commented 9 months ago

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

I have already paid attention to two models, soundstream and EnCodec, which are very close to my research direction. So my arrangement is like this: For my first paper, I want to do some research based on soundstream, but I haven't found a suitable research site yet. The second paper wants to do some research based on EnCodec, so I have been studying soundstream recently and will start to study EnCodec after the New Year. This is my plan。

kaiidams commented 9 months ago

https://github.com/aliutkus/speechmetrics I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?

Thank you for input. I have used neither of them. I'll look into them.

Can you be more specific?

This is just random idea. The codec keeps a constant rate, which is given by the user. For example, if you decide to use the first 5 codebooks, then the data rate is constantly 6kbps * 5 / 8 = 3.75kbps. even there's no audio signal. Maybe the codec can decide by itself how many codebooks to use for the given audio by using more bits for important frames and less bits for less important frames.

a897456 commented 9 months ago

You set the step size 2 4 5 8 = 320 and the segment_length=32270, then output = Encoder(input), and the output is a tensor(32,102,512).

When I set the step size 2 4 5 6 =240 and the segment_length=30430, the the output is a tensor(32,127,512)

So I want to ask two questions:

The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.

The second problem, if I increase the segment_length and decrease the step size, the second parameter of the output tensor will also increase, for example, from your 102 to my 127 or more, and I know that the second parameter is related to the number of codebooks, which is 1024, What is the effect of increasing the second value?

kaiidams commented 9 months ago

The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.

Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.

I know that the second parameter is related to the number of codebooks

The second dimension of an encoded tensor is time-axis. If the output is a tensor(32,102,512), it means it encodes 102 320 = 32640 samples ignoring paddings. The third dimension is the hidden dimension. We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 8 codes, i.e., tensor(32, 102, 8)

a897456 commented 9 months ago

Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.

I made a mistake, I wanted to ask 50000, not 5000, because I thought the larger the number, the more it represented X, but in fact, I found that increasing the number resulted in a doubling of training time. So I'll pick an appropriate integer around the 32270 you set, not too big or too small, such like 30000-35000

We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 * 8 codes, i.e., tensor(32, 102, 8)

THS,I get it. My idea is to reduce the step size while increasing the segment_length, and keep the number of num_codebooks the same, (because I think the change of these three parameters may be helpful for lower bit rate compression) , and look at the PESQ score.

a897456 commented 9 months ago

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L647 Is the weight in this place only a batch value? Is there a missing for loop here? like "for batch in iterator" or something My understanding is this: weight is the early code book, and all the data needs to be classified, for example, divided into 1024 categories, and then formed 1024 code books. Did I get it wrong?

kaiidams commented 9 months ago
torch.nn.init.normal_(model.quantizer.weight, mean=mean, std=std) 

This initializes the codebook. The weight of the codebook is 8 x 1024 x 512. The number of codebooks is 8 and the number of code in one codebook is 1024. In the beggining of the training, we want this to be close to the distribution of encoder's outputs. In the page 5 of the paper, it says,

initialization for the codebook vectors, we run the k-means algorithm on the first training batch and use the learned centroids as initialization

I skipped this and just initialized with the gaussian of the first training batch as I didn't want to run k-means in the training code. The code calls the model but this is required only once.

a897456 commented 9 months ago

image

First, do you know what the second part of the reconstruction loss formula in the paper is? Why log the mel spectrum?

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L385 Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L383 https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L347 Third,does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?

kaiidams commented 9 months ago

Why log the mel spectrum?

Log Mel-spectrum is believed to be close to human perception. S(x) above is linear to the power, but human perception is linear to the log of power. Probably the second part (5) is what we want to minimize.

Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?

I didn't understand what 'intermediate parameter' is. But originally the loss formula comes from https://arxiv.org/pdf/2008.01160.pdf where they use STFT not Mel-spectrum like SoundStream paper.

Third,does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?

STFT discriminator is based on GAN technique. STFT reconstruction is auto-regression which are different. GAN is used for audio generation task (and image generation) because there're a lot of possible audio outputs which are good for humans but very different in auto-regressive loss, because of likes of shifted phase of audio, (or shifted image). However, GAN doesn't generate audio output which is close to the original, so they use weaker auto-regressive tasks to help audio generation.

a897456 commented 9 months ago

I didn't understand what 'intermediate parameter' is.

The MFCC solution process I have learned is as follows: for the voice signal, after adding a window, get the energy distribution on the spectrum through FFT (just like your STFT? ), then get the power spectrum throught the square of the modulus , then get the mel spectrum through the mel filter, then get the Fbank through the Log (just like the second part of the reconstruction loss formula?), then get the MFCC through the DCT.

So I guess the STFT your code uses is the 'intermediate parameter' which is the energy distribution on the spectrum, and the log of STFT which is the second part of the reconstruction loss formula is the Fbank which is indeed a parameter close to human perception.

kaiidams commented 9 months ago

Yes, you're right. STFT is used to get mel-spectrum. SoundStream paper uses mel-spectrum and https://arxiv.org/pdf/2008.01160.pdf uses STFT.

a897456 commented 9 months ago

Sorry to bother you again. I found that if I change batch_size from 32 to 16, the training speed will be increased by 10 times, but I haven't finished the training yet, so I don't know if the training at such a fast speed means that the training is incomplete or ineffective. Besides, why did you set batch_size to 32 in the first place?

kaiidams commented 9 months ago

Usually you want to use largest batch size of your GPU for more efficiency. Increasing batch size twice shouldn't increase the step time more than twice. Also increasing batch size generally makes stable results as larger batch are less variant. I don't know why batch_size=16 is 10 times faster, but it is great if it is still stable.

a897456 commented 8 months ago

Excuse me, have you ever tried to reconstruct the voice signal through the MFCC? Or have you ever seen someone else do it?

kaiidams commented 8 months ago

I haven't tried MFCC myself. I think MFCC is not so popular for voice features as melspec because deep-learning based models are strong enough, like HiFi GAN and MelGAN use melspec. But MFCC might be good (or no good) when calculating reconstruction loss of vocodecs.

a897456 commented 8 months ago

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L408 https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L409

image 043dc1d0d66c0bb863d9a048aa8d269

Excuse me for bothering you again, may I ask why there are two losses in this place? Because I found that the rec_loss is very large, and the g_loss is also large, so I found two losses in this place

kaiidams commented 8 months ago

In my case g_rec_loss is around 10. Do you see other anormalities?

g_stft_loss g_wave_loss g_feat_loss g_rec_loss q_loss g_loss codes_entropy d_stft_loss d_wave_loss d_loss num_replaced epoch step
8.765625 2.03125 0.035614 13.462036 0.385002 20.735474 6.826962 0.0 1.387695 1.041016 0.0 24 21487
a897456 commented 8 months ago

image Did you change the Mel-spectrum to STFT at the beginning because there are many negative numbers in the Mel-spectrum? If LOG operation is performed according to the formula in the figure above, the loss will have the problem of NAN.

a897456 commented 8 months ago

In my case g_rec_loss is around 10. Do you see other anormalities?

g_stft_loss g_wave_loss g_feat_loss g_rec_loss q_loss g_loss codes_entropy d_stft_loss d_wave_loss d_loss num_replaced epoch step 8.765625 2.03125 0.035614 13.462036 0.385002 20.735474 6.826962 0.0 1.387695 1.041016 0.0 24 21487

I am trying to replace the discriminator in your code with the MSD and MPD modules of HIFIGAN, but it has not been successful. The output speech after training is white noise, and I have been looking for the reason, thinking that loss cannot converge. So the loss parameter that you and I expressed looks different.

In addition, I heard that HIFIGAN's discriminator is the most useful discriminator at present, and I want to add it to your code. I have finished adding, now the code can run through, but the output speech after training is always white noise, I can't find the problem, can you help me to achieve it?

a897456 commented 8 months ago

Previous test results(At first, it was over 100, but when the step increased, it dropped to 20)

image image

Current test results(At the beginning, it was over 100, but it didn't decrease as the step increased)

image

the g_rec_loss in g_loss does not converge

kaiidams commented 8 months ago

I'm not sure about the reason why.

HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.

image