jackhwalters / DCS-Net

Implementation of paper "DCS-Net: Deep Complex Subtractive Neural Network for Monaural Speech Enhancement"
10 stars 0 forks source link

about hop_lenth #1

Open ioyy900205 opened 2 years ago

ioyy900205 commented 2 years ago

Hello author, in the config.py file, is the hop_lenth 256 instead of 32?

jackhwalters commented 2 years ago

Hi, the hop_length variable in config.py is set to 32, in past commits it has been more though

ioyy900205 commented 2 years ago

I mean, considering a frame length of 32 ms and a frame shift of 16 ms, shouldn't hop_lenth be set to 256? Also, I found that in the def mag_phase_2_wav() function, there is something wrong, after my test the value of pad should not be 0,0,0,1 but 0,0,1,0. Secondly, in the istft of this function, you need to add window to recover the voice perfectly

jackhwalters commented 2 years ago

I've made sure to keep the hop_length as a power of 2 less than fft_size so I am at liberty to experiement with various window types. I'd be reluctant to make hop_length 256 as that would come at the cost of time resolution.

I padded (0,0,0,1) so that I'm padding DC with 0. I'd rather have done this than added 0 to the top frequency bin, which would have altered the signal. Great spot on the lack of window in the ISTFT, I'm implementing that now.

ioyy900205 commented 2 years ago

Thank you for your reply, I understand what you mean. But I have experimented with your code in the padding process. Because you removed the DC component after STFT, you need to add the DC component before ISTFT. But the correct way should be (0,0,1,0) . Because I could not recover the voice correctly using the 0,0,0,1 you provided. In addition, I recovered the voice perfectly using 0,0,1,0. Perhaps the problem lies in the fact that the top part of the input that ISTFT understands is the low frequency part

jackhwalters commented 2 years ago

I just checked this and you're right. I was assuming padding_bottom from the documentation would pad the bottom of the STFT, but it's padding the bottom of the tensor which is the reverse! I assume you calculated something like the L1 between the waveforms to show the difference?

ioyy900205 commented 2 years ago

Hi, I output the original speech and the transformed speech to the speech spectrogram to check whether they can be perfectly reconstructed. I found another problem, please check your complex_adaptive_max_pool2d function, this is the same as complex_adaptive_avg_pool2d now. : ( I think this repo will help you. https://github.com/TVS-AI/Pytorch_rppgs/blob/52568e4cc4a4069007bb4f65a6963901b0d0df91/nets/funcs/complexFunctions.py

jackhwalters commented 2 years ago

I've been using https://github.com/wavefrontshaping/complexPyTorch/blob/master/complexPyTorch/complexFunctions.py as a reference, which looks similar to what you've sent. Good spot though, I must have not changed the function names when copying them over from complex_adaptive_avg_pool2d. I'm not convinced that the technique of max pooling the magnitude from intermediate feature maps then recombining them with the layer input phase is the best way to go, considering one of the advantages of complex networks is phase prediction from complex-number feature maps. However I can see why they wanted to perform maxpooling on exclusively non-negative feature maps. I will try both!

ioyy900205 commented 2 years ago

The reason for using maxpooling is to implement the CBAM attention mechanism. There are too many attention mechanisms out there now, and your attempt will make a great contribution.

I also have a doubt that there is no way to use DDP acceleration with the current form of X.real,X.imag data, because the NCCL backend does not support pytorch's complex arithmetic. I don't know if pytorchlighting supports

ioyy900205 commented 2 years ago

DCSNet网络图 (1) I drew a diagram to describe your network for the case where the input is 1,256,1876

jackhwalters commented 2 years ago

Yes my use of maxpooling is inspired by the channel attention modules in https://arxiv.org/abs/2102.01993, which themselves are inspired by CBAM. About the DDP, I ran into this issue of not being able to use DDP with complex tensors; I actually created an issue in NCCL's repo. I considered using torch.view_as_real throughout the repo to be able to use DDP, but didn't really seem worth it. That diagram is correct, aside from the input is 256x256 to account for variable sample length.