ioyy900205 / MFNet

This repo provides the processed samples of the manuscript "a Mask Free Neural Network for Monaural Speech Enhancement", which was accepted by INTERSPEECH2023.
MIT License
36 stars 4 forks source link

VCTK数据集训练,loss极大 #4

Open taqta opened 10 months ago

taqta commented 10 months ago

您好,您的工作给了我很大的启发,非常感谢! 我想测试您的模型在voicebank demand 数据集上的效果,根据您论文中提到的init-lr=0.0034,余弦退火结合预热调节学习率,在第五个epoch达到最大学习率来训练模型,其中模型按照您在其他issuehttps://github.com/ioyy900205/MFNet/issues/1建议的方式设计。发现模型很容易在训练过程中出现loss极大的问题,这是训练日志的截图

image

可以看到,在第10个epoch 中loss达到了e+17的数量级,请问您在训练过程中有遇到这种问题吗,如果遇到这种问题又是如何解决的呢?

ioyy900205 commented 9 months ago

您好,您的工作给了我很大的启发,非常感谢! 我想测试您的模型在voicebank demand 数据集上的效果,根据您论文中提到的init-lr=0.0034,余弦退火结合预热调节学习率,在第五个epoch达到最大学习率来训练模型,其中模型按照您在其他issuehttps://github.com/ioyy900205/MFNet/issues/1建议的方式设计。发现模型很容易在训练过程中出现loss极大的问题,这是训练日志的截图 image 可以看到,在第10个epoch 中loss达到了e+17的数量级,请问您在训练过程中有遇到这种问题吗,如果遇到这种问题又是如何解决的呢?

您好,感谢您的关注。我检查了我这边训练的代码,我论文中的lr有误,实际上是3e-4, 我写成了0.0034,很抱歉。训练策略是一次余弦退火,在5th epoch达到最大学习率,没有问题。 我在训练过程中,没有遇到loss极大的问题,另外,我也确认了一下,我这里模型没有使用梯度裁剪策略(因为这里没有使用LSTM)。我这边获得最优结果的training_loss是0.0134。我观察到您的loss是非常小的,与我这里不一致。我观察第二点是您PESQ的结果存在问题。根据我的经验,这里一般都是信号部分出了问题。 如果使用了第三方DCTCRN的repo,这里实现的正反dct信号部分存在问题。具体,您可以这么检查,cleanwav保持不变(不要经过正反dct),noisy经正反dct之后再与clean计算snr,看看结果如何。

Hello, thank you for your attention. I have checked the code that I used for training on my end, and there appears to be a mistake with the learning rate (lr) in my paper. In reality, it is 3e-4, but I wrote 0.0034, and I apologize for that. The training strategy is a single cosine annealing that reaches the maximum learning rate at the 5th epoch, and there are no issues with that. I didn't encounter any problems with extremely large loss values during the training process, and I have also confirmed that I didn't use any gradient clipping strategies, as there was no use of LSTM in the model. The optimal training loss I achieved here is 0.0134.

I noticed that your loss is very small, which is inconsistent with my results. A second observation I have is there seems to be a problem with your PESQ results. Based on my experience, typically this indicates a problem with the signal processing part. If you have been using a third-party DCTCRN repo, there may be issues with their implementation of the forward and inverse DCT signal processing. Specifically, you might want to check by keeping the cleanwav unchanged (not subjected to forward and inverse DCT), and then after applying forward and inverse DCT to the noisy signal, calculate the SNR against the clean one, and see what the results are.

taqta commented 9 months ago

非常感谢您的建议。我发现问题确实出现在signal的部分,在计算dct的时候使用了ortho的归一化方法,如下图: image 这导致即使是noisy和clean的mse loss 也只有e-5的数量级,如果不使用这个归一化方法,数量级在e-2。 但是当我使用stft作为输入输出的时候,在vctk 的表现也比较差,pesq最高只有3.0, image 模型是按照https://github.com/ioyy900205/MFNet/issues/1#issuecomment-1805387135 设计的,在处理输入的时候,为了对齐长度,会随机截取2s的数据,如果音频长度不够,则会按照下图策略处理 image 这个处理策略参考的是CMGAN的处理方法,请问这样的处理会不会导致pesq的表现不高呢?

ioyy900205 commented 9 months ago

您好,我猜测这个是在训练的时候确保所有语音的长度一致,对pesq表现的影响我不是很确定,但是我感觉不会有太大影响。如果使用STDCT特征,我建议使用压缩普,会获得更好的结果。如果是STFT特征,我觉得mag的loss是很有必要的。不知道您这边网络预测的是什么,我认为处理方式对结果会有比较大的影响。

taqta commented 9 months ago

嗯嗯,感谢大佬的指导!我再试试。

taqta commented 9 months ago

您好,您的工作给了我很大的启发,非常感谢! 我想测试您的模型在voicebank demand 数据集上的效果,根据您论文中提到的init-lr=0.0034,余弦退火结合预热调节学习率,在第五个epoch达到最大学习率来训练模型,其中模型按照您在其他issuehttps://github.com/ioyy900205/MFNet/issues/1建议的方式设计。发现模型很容易在训练过程中出现loss极大的问题,这是训练日志的截图 image 可以看到,在第10个epoch 中loss达到了e+17的数量级,请问您在训练过程中有遇到这种问题吗,如果遇到这种问题又是如何解决的呢?

您好,感谢您的关注。我检查了我这边训练的代码,我论文中的lr有误,实际上是3e-4, 我写成了0.0034,很抱歉。训练策略是一次余弦退火,在5th epoch达到最大学习率,没有问题。 我在训练过程中,没有遇到loss极大的问题,另外,我也确认了一下,我这里模型没有使用梯度裁剪策略(因为这里没有使用LSTM)。我这边获得最优结果的training_loss是0.0134。我观察到您的loss是非常小的,与我这里不一致。我观察第二点是您PESQ的结果存在问题。根据我的经验,这里一般都是信号部分出了问题。 如果使用了第三方DCTCRN的repo,这里实现的正反dct信号部分存在问题。具体,您可以这么检查,cleanwav保持不变(不要经过正反dct),noisy经正反dct之后再与clean计算snr,看看结果如何。

Hello, thank you for your attention. I have checked the code that I used for training on my end, and there appears to be a mistake with the learning rate (lr) in my paper. In reality, it is 3e-4, but I wrote 0.0034, and I apologize for that. The training strategy is a single cosine annealing that reaches the maximum learning rate at the 5th epoch, and there are no issues with that. I didn't encounter any problems with extremely large loss values during the training process, and I have also confirmed that I didn't use any gradient clipping strategies, as there was no use of LSTM in the model. The optimal training loss I achieved here is 0.0134.

I noticed that your loss is very small, which is inconsistent with my results. A second observation I have is there seems to be a problem with your PESQ results. Based on my experience, typically this indicates a problem with the signal processing part. If you have been using a third-party DCTCRN repo, there may be issues with their implementation of the forward and inverse DCT signal processing. Specifically, you might want to check by keeping the cleanwav unchanged (not subjected to forward and inverse DCT), and then after applying forward and inverse DCT to the noisy signal, calculate the SNR against the clean one, and see what the results are.

您好,我想问一下您在训练过程中第一个epoch训练完的pesq大概是多少呢,我用CMGAN在VCTK上训练的时候第一个epoch训练完在测试集的pesq已经达到2.7,但是用MFNET不管是使用STFT或者STDCT作为输入,第一个epoch的pesq都停留在1.5左右,这是不是意味着代码某个部分没写好。

ioyy900205 commented 9 months ago

您好,如果是使用warmup的方式,第一轮学习率是非常小的,所以会导致结果偏低。如果需要公平比较,可以将两个模型学学习率放在同样的情况下测试。我目前没有在这个数据集上训练过。没有办法给您准确的结果哈

taotaowang97479 commented 9 months ago

您好,您的工作给了我很大的启发,非常感谢! 我想测试您的模型在voicebank demand 数据集上的效果,根据您论文中提到的init-lr=0.0034,余弦退火结合预热调节学习率,在第五个epoch达到最大学习率来训练模型,其中模型按照您在其他issuehttps://github.com/ioyy900205/MFNet/issues/1建议的方式设计。发现模型很容易在训练过程中出现loss极大的问题,这是训练日志的截图 image 可以看到,在第10个epoch 中loss达到了e+17的数量级,请问您在训练过程中有遇到这种问题吗,如果遇到这种问题又是如何解决的呢?

您好,感谢您的关注。我检查了我这边训练的代码,我论文中的lr有误,实际上是3e-4, 我写成了0.0034,很抱歉。训练策略是一次余弦退火,在5th epoch达到最大学习率,没有问题。 我在训练过程中,没有遇到loss极大的问题,另外,我也确认了一下,我这里模型没有使用梯度裁剪策略(因为这里没有使用LSTM)。我这边获得最优结果的training_loss是0.0134。我观察到您的loss是非常小的,与我这里不一致。我观察第二点是您PESQ的结果存在问题。根据我的经验,这里一般都是信号部分出了问题。 如果使用了第三方DCTCRN的repo,这里实现的正反dct信号部分存在问题。具体,您可以这么检查,cleanwav保持不变(不要经过正反dct),noisy经正反dct之后再与clean计算snr,看看结果如何。 Hello, thank you for your attention. I have checked the code that I used for training on my end, and there appears to be a mistake with the learning rate (lr) in my paper. In reality, it is 3e-4, but I wrote 0.0034, and I apologize for that. The training strategy is a single cosine annealing that reaches the maximum learning rate at the 5th epoch, and there are no issues with that. I didn't encounter any problems with extremely large loss values during the training process, and I have also confirmed that I didn't use any gradient clipping strategies, as there was no use of LSTM in the model. The optimal training loss I achieved here is 0.0134. I noticed that your loss is very small, which is inconsistent with my results. A second observation I have is there seems to be a problem with your PESQ results. Based on my experience, typically this indicates a problem with the signal processing part. If you have been using a third-party DCTCRN repo, there may be issues with their implementation of the forward and inverse DCT signal processing. Specifically, you might want to check by keeping the cleanwav unchanged (not subjected to forward and inverse DCT), and then after applying forward and inverse DCT to the noisy signal, calculate the SNR against the clean one, and see what the results are.

您好,我想问一下您在训练过程中第一个epoch训练完的pesq大概是多少呢,我用CMGAN在VCTK上训练的时候第一个epoch训练完在测试集的pesq已经达到2.7,但是用MFNET不管是使用STFT或者STDCT作为输入,第一个epoch的pesq都停留在1.5左右,这是不是意味着代码某个部分没写好。

after 1 epochs, evaluation on test: Avg_loss: 0.02159, STOI: 0.9252, SNR: 16.1562, PESQ: 2.6024 你可以参考下,跟作者交流过,90%是idct变换重构有问题。

taqta commented 9 months ago

嗯嗯,谢谢!我检查过idct变换,是可以恢复成原始信号的,佬你有复现出来吗

taotaowang97479 commented 9 months ago

嗯嗯,谢谢!我检查过idct变换,是可以恢复成原始信号的,佬你有复现出来吗

我的复现情况可以参考:https://github.com/taotaowang97479/MFNet-SpeechEnhancement

taqta commented 9 months ago

谢谢佬!我的结果和你一样,不过这个表现与数据集中sota结果相比会比较差,所以我在想是不是哪里还存在问题

非常感谢您的建议。我发现问题确实出现在signal的部分,在计算dct的时候使用了ortho的归一化方法,如下图: image 这导致即使是noisy和clean的mse loss 也只有e-5的数量级,如果不使用这个归一化方法,数量级在e-2。 但是当我使用stft作为输入输出的时候,在vctk 的表现也比较差,pesq最高只有3.0, image 模型是按照#1 (comment) 设计的,在处理输入的时候,为了对齐长度,会随机截取2s的数据,如果音频长度不够,则会按照下图策略处理 image 这个处理策略参考的是CMGAN的处理方法,请问这样的处理会不会导致pesq的表现不高呢?

taotaowang97479 commented 9 months ago

我感觉这样处理数据影响应该也不大,我是同一个batch中,按最长的语音长度(最长不超过4秒)为准,填充的都是0,你用DCT的结果如何,跟我的差不多吗?

taqta commented 9 months ago

我复现出来的dct的结果最高也是3.0,但是从MFNET在DNS数据集的表现上看,它在VCTK上的表现应该更高,毕竟VCTK数据集的其他模型pesq都达到了3.3-3.5,所以我总觉得应该还是哪里有点问题