jeonchangbin49 / MedleyVox

69 stars 5 forks source link

Usage of SFSRNET #1

Closed faroit closed 1 year ago

faroit commented 1 year ago

Hi @jeonchangbin49 thanks for this great paper. I have a question regarding the use of the upsampling model:

As i see from the code, all backbone models are (in the second phase) initialized with 24khz – the same for SFSRNET. So that means the actual upsampling is only effective when using the pretrained 16khz convtasnet model?

Maybe i don't understand this paradigm well yet but where is the downsampling actually happening? in case of the SFSRNET paper, they used the stride in the sepformer to reduce the sequence length and thus improved the performance. I don't see this here but maybe I am missing something?

Thanks for your help

jeonchangbin49 commented 1 year ago

Hi, @faroit ! Thanks for your interest for our work. It is so great to meet you online!

I think the confusion might come from the naming of the original Super-resolution net (SRNet, https://ojs.aaai.org/index.php/AAAI/article/view/21372). In that paper, super-resolution refers to the framework that was used to enhance the original estimates of the SepFormer (which stands for the SF of SFSRNet), instead of upsampling (which is a common or general meaning of super-resolution in signal processing field).

Below are from the original SFSRNet paper, 'Introduction-Basic idea' section. "The proposed SR network is different from common upsampling tasks as follows. Since the goal of the upsampling in this case is to return to the original audio signal frequency, it is not necessary to generate any new information. Instead, aside from the downsampled estimations, the input audio mixture in its original sampling rate is used as an additional input to improve the upsampling process."

So, the SRNet (or iSRNet) is attached to a backbone model with the same sample rate. Though we didn't use any downsampling in Conv-TasNet based on STFT/iSTFT, we still used the term "iSRNet" as it follows the same concept as the SRNet proposed in the SFSRNet paper (enhancing the original estimate).

Hope you find this answer helpful!

faroit commented 1 year ago

@jeonchangbin49 Thanks for this comment.

So the backbone model runs in the same sample rate as SRnet and the purpose is to enhance (potentially) missing high frequency content?

if one would want to use... let's say a 8khz Sepformer, upsampling would be needed first, to connect it to the SRNet, if the purpose is to speed up the performance of the backbone model but still output frequencies above 8khz.

For this case (I think that's the last section in the original SRNet paper) a stride of 8 is used to reduce the sequence length of the Sepformer output. So how would you do this for an STFT based backbone model?

jeonchangbin49 commented 1 year ago

I'm not sure if I understand your questions correctly, so please feel free to leave any corrections or another question.

Q: So the backbone model runs in the same sample rate as SRnet and the purpose is to enhance (potentially) missing high frequency content? A: Yes. The backbone model runs in the same sample rate as SRNet. In the original paper, they used downsampled signal as intermittent loss of each SepFormer (I've forgot about this detail). But, the final output should be the same sample rate as the input of SRNet. The purpose is to enhance or improve high frequency content of outputs. I think Figure 4. of the original SFSRNet paper shows the exact example of the purpose of SRNet.

Q: For this case (I think that's the last section in the original SRNet paper) a stride of 8 is used to reduce the sequence length of the Sepformer output. So how would you do this for an STFT based backbone model? A: I suppose 'a stride of 8' refers to the stride of the encoder (from encoder - separator - decoder framework, which is a general framework in speech separation fields). In the original SFSRNet paper, the learnable-basis encoder and decoder (1D-Conv and Transposed Conv) were used. On the other hand, we used STFT/iSTFT here, we used 512 hop size and 2048 n_fft size. I think the confusion comes from the usage of words 'downsampling' or 'upsampling' in the papers (SFSRNet, SepFormer, SuDo-RM-RF). It can be misleading because it is different from the general meaning of resampling in signal processing (changing a sample rate of signal, say from 8kHz to 16kHz).

Q: if one would want to use... let's say a 8khz Sepformer, upsampling would be needed first, to connect it to the SRNet, if the purpose is to speed up the performance of the backbone model but still output frequencies above 8khz. A: I'm not sure about this question. Do you mean you are using 8kHz SepFormer + 16kHz SRNet? As far as I know, there are no such experiments in the original SFSRNet paper but it sounds really interesting. Because SepFormer is reaaaaally costly to train, it would be useful if this is possible. Maybe I should try this someday.

faroit commented 1 year ago

@jeonchangbin49 sorry for the confusion. As I understand, downsampling and upsampling here refers basically to linear interpolation in combination with the neural network layers...

I am referring to this section:

image

so in my understanding it doesn't really matter how upsampling and downsampling is realized. For time-domain networks this could be done by just subsampling the sequence lengths but in general a real resampling could indeed be be used as well (as long as its differentiable).

here is a very basic example:

def forward(self, mix: torch.Tensor) -> dict[str, torch.Tensor]:
    downsampled_mix = self.downsample(mix)
    X = self.encoder(downsampled_mix)
    Q = self.separator(X)
    S = self.masker(Q) * X
    y_downsampled = self.decoder(S)
    y = self.upsample(y_downsampled)
    return self.sfsrnet(mix, y)

maybe @j-rixen can clarify it a bit more?

jeonchangbin49 commented 1 year ago

Hmm... I still think that the downsampling here is just applying the encoder with a stride which results in reduced sequence length. Below is from the last section of the original SepFormer paper (https://arxiv.org/abs/2010.13154)

image

However, in U-ConvBlock used in SuDO-RM-RF, Downsampling and upsampling are exactly what you said. Below is from the original SuDO-RM-RF paper (https://arxiv.org/abs/2007.06833).

image

By the way, your example seems really intriguing, but I'm not sure that SFSRNet can work in that way. I also wonder how Joel think about this.

j-rixen commented 1 year ago

downsampling here is just applying the encoder with a stride

This is correct. The concept of the SFSRNet is to produce estimations using a slightly changed SepFormer model and then correct them with the SR model. Using convolutional layers for the downsampling and upsampling significantly increases accuracy of these initial estimations, so using just linear interpolation would make the SFSRNet way less accurate.

jeonchangbin49 commented 1 year ago

Wow, it's been 3 weeks and I forgot to leave a comment on this... Thank you Joel @j-rixen for the clarification and thank you @faroit for your questions and comments!!