pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 652 forks source link

New interface for MVDR beamforming #2158

Open Emrys365 opened 2 years ago

Emrys365 commented 2 years ago

πŸš€ The feature

It would be very helpful to provide the following interface for the beamforming module (torchaudio.transforms.MVDR):

forward(specgram: torch.Tensor, psd_s: torch.Tensor, psd_n: torch.Tensor) β†’ torch.Tensor

and maybe add some high-level glue functions that takes the masks as input, but has only a few lines of code.

Motivation, pitch

The current forward method of torchaudio.transforms.MVDR only accepts spectrogram and masks as input, and calculates the PSD matrices internally.

The current design is easy to use mainly for mask-based beamforming, but it may lose the flexibility for the users to:

Alternatives

No response

Additional context

No response

nateanl commented 2 years ago

@mthrok What do you think? I feel like the current torchaudio.transforms.MVDR can be used as the high-level module, which calls another module with specgram, psd_speech, and psd_noise as input arguments in the forward method.

Maybe we should rename the current MVDR module to a generalized MaskBasedBeamforming module which accepts any kind of beamforming method, similar to DNN_Beamformer in ESPNet, but without trainable parameters. Then everytime we add a new beamforming module, we can add such support in the MaskBasedBeamforming.

mthrok commented 2 years ago

@nateanl Can you make a diagram of changes you propose with call graph and signatures? (see the following for an inspiration)

nateanl commented 2 years ago

Here is the diagram for changing the API of MVDR.

MVDR (2)

mthrok commented 2 years ago

Here is the diagram for changing the API of MVDR.

  • [ ] Move the get_mvdr_vector and apply_beamforming_vector to torchaudio.functional, and let the current MVDR module call the methods under torchaudio.functional. This aligns with ESPNet and pb_bss's function design.
  • [ ] Add a new module called MVDRBeamformer (for example). Use specgram, psd_s, psd_n, and ref_mic as the input arguments of forward method, as @popcornell mentioned the reference microphone might be dynamic (estimated via a neural network). This design aligns with Asteroid.
  • [ ] Add a warning in the current MVDR module notifying users to use MVDRBeamformer instead.

MVDR (2)

Sounds good.

nateanl commented 2 years ago

What happens to PSD(specgram, mask) module?

It will be outside of the new MVDR module. When users want to apply MVDR beamforming, they need to compute the psd matrices either by PSD module with the mask, or estimate them in other ways. We can also add a function like compute_power_spectral_density_matrix under torchaudio.functional.

I prefer using a prefix other than get. say, for example compute. Thoughts?

Sounds good to me. Since there are three solutions to MVDR, we can make those methods names as:

However, apply_beamforming may only apply to traditional beamforming methods. For the new proposed ones like multi-frame MVDR or WPD, we need to think of a new API. Any thoughts? @Emrys365 @popcornell @boeddeker

popcornell commented 2 years ago

for multi-frame MWF as defined in https://arxiv.org/abs/1911.07953 I think there won't be a problem as basically the other frames are treated as other microphone channels.

I think it should be fine to have one function which can handle 90% of the use-cases out there. Might want to design it also to handle time-varying beamforming filters too. It should not be too hard.

boeddeker commented 2 years ago

I would say, try to keep it simple and "low level" (i.e. one equation, one function). With apply_beamforming you get, as popcornell said, most users and maybe there is no generalization. The WPD beamformer (probably also multi-frame MWF) can be formulated, such that apply_beamforming can be used (i.e. manipulate specgram and stack the history to the channel dimension). Since it is essentially just an einsum, it should be easy for a user to implement alternatives.

I have a question to the compute_mvdr_vector_rtf. From the name, I thought, it would take the RTF (relative transfer function) as input. I think that function combines two things, the calculation of the mvdr_vector and the calculation of the rtf.

nateanl commented 2 years ago

We can split it into these functions:

On the module-based design, for example, torchaudio.transforms.RTFMVDRBeamformer, shall we include the computation of rtf and mvdr vector together, or shall we also use rtf as the argument, same as compute_mvdr_vector_rtf?