Open Emrys365 opened 2 years ago
@mthrok What do you think? I feel like the current torchaudio.transforms.MVDR
can be used as the high-level module, which calls another module with specgram
, psd_speech
, and psd_noise
as input arguments in the forward
method.
Maybe we should rename the current MVDR module to a generalized MaskBasedBeamforming
module which accepts any kind of beamforming method, similar to DNN_Beamformer
in ESPNet, but without trainable parameters. Then everytime we add a new beamforming module, we can add such support in the MaskBasedBeamforming
.
@nateanl Can you make a diagram of changes you propose with call graph and signatures? (see the following for an inspiration)
Here is the diagram for changing the API of MVDR.
[ ] Move the get_mvdr_vector
and apply_beamforming_vector
to torchaudio.functional
, and let the current MVDR module call the methods under torchaudio.functional
. This aligns with ESPNet and pb_bss's function design. https://github.com/pytorch/audio/pull/2181
[ ] Add a new module called MVDRBeamformer
(for example). Use specgram
, psd_s
, psd_n
, and ref_mic
as the input arguments of forward
method, as @popcornell mentioned the reference microphone might be dynamic (estimated via a neural network). This design aligns with Asteroid.
[ ] Add a warning in the current MVDR module notifying users to use MVDRBeamformer
instead.
Here is the diagram for changing the API of MVDR.
- [ ] Move the
get_mvdr_vector
andapply_beamforming_vector
totorchaudio.functional
, and let the current MVDR module call the methods undertorchaudio.functional
. This aligns with ESPNet and pb_bss's function design.- [ ] Add a new module called
MVDRBeamformer
(for example). Usespecgram
,psd_s
,psd_n
, andref_mic
as the input arguments offorward
method, as @popcornell mentioned the reference microphone might be dynamic (estimated via a neural network). This design aligns with Asteroid.- [ ] Add a warning in the current MVDR module notifying users to use
MVDRBeamformer
instead.
Sounds good.
PSD(specgram, mask)
module?get_
. say, for example compute_
. Thoughts?What happens to PSD(specgram, mask) module?
It will be outside of the new MVDR module. When users want to apply MVDR beamforming, they need to compute the psd matrices either by PSD module with the mask, or estimate them in other ways.
We can also add a function like compute_power_spectral_density_matrix
under torchaudio.functional
.
I prefer using a prefix other than get. say, for example compute. Thoughts?
Sounds good to me. Since there are three solutions to MVDR, we can make those methods names as:
compute_mvdr_vector_souden(psd_speech, psd_noise, ref_channel)
compute_mvdr_vector_rtf(psd_speech, psd_noise, ref_channel, solution="power")
compute_gev_vector(psd_speech, psd_noise, ref_channel)
apply_beamforming(bf_vector, specgram)
However, apply_beamforming
may only apply to traditional beamforming methods. For the new proposed ones like multi-frame MVDR or WPD, we need to think of a new API. Any thoughts? @Emrys365 @popcornell @boeddeker
for multi-frame MWF as defined in https://arxiv.org/abs/1911.07953
I think there won't be a problem as basically the other frames are treated as other microphone channels.
I think it should be fine to have one function which can handle 90% of the use-cases out there. Might want to design it also to handle time-varying beamforming filters too. It should not be too hard.
I would say, try to keep it simple and "low level" (i.e. one equation, one function). With apply_beamforming
you get, as popcornell said, most users and maybe there is no generalization. The WPD beamformer (probably also multi-frame MWF) can be formulated, such that apply_beamforming can be used (i.e. manipulate specgram
and stack the history to the channel dimension). Since it is essentially just an einsum, it should be easy for a user to implement alternatives.
I have a question to the compute_mvdr_vector_rtf
. From the name, I thought, it would take the RTF (relative transfer function) as input. I think that function combines two things, the calculation of the mvdr_vector and the calculation of the rtf.
We can split it into these functions:
compute_mvdr_vector_rtf(rtf, psd_noise, ref_channel)
compute_rtf_power(psd_speech, psd_noise, ref_channel)
compute_rtf_evd(psd_speech)
On the module-based design, for example, torchaudio.transforms.RTFMVDRBeamformer
, shall we include the computation of rtf and mvdr vector together, or shall we also use rtf
as the argument, same as compute_mvdr_vector_rtf
?
π The feature
It would be very helpful to provide the following interface for the beamforming module (torchaudio.transforms.MVDR):
and maybe add some high-level glue functions that takes the masks as input, but has only a few lines of code.
Motivation, pitch
The current forward method of
torchaudio.transforms.MVDR
only accepts spectrogram and masks as input, and calculates the PSD matrices internally.The current design is easy to use mainly for mask-based beamforming, but it may lose the flexibility for the users to:
Alternatives
No response
Additional context
No response