Add support for spatial feature extraction on multi-microphone data

🚀 The feature

In addition to the readily available spectral features (https://pytorch.org/audio/stable/transforms.html#feature-extractions), I would like to propose a request for extracting spatial features from the multi-microphone (multi-channel) speech data.

There are many commonly used spatial features in the literature, including:

Inter-channel Phase Difference (IPD) [1]: based on this, we can calculate cosIPD and sinIPD

[1] J. Benesty, J. Chen, and Y. Huang, “Microphone array signal processing,” Springer Science & Business Media, vol. 1, 2008.
Inter-channel Level Difference (ILD) [1]
Generalized Cross-Correlation (GCC) [2], GCC with Phase Transform (GCC-PHAT) [2] and Steered-Response Power Phase Transform (SRP-PHAT) [3]

[2] C. Knapp, and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976. [3] J. H. DiBaise, H. F Silverman and M. S. Brandstein, “Microphone Arrays Signal Processing Techniques and Applications: Robust Localization in Reverberant Rooms,” Ed. Berlin, Germany: Springer, 2001, pp. 157-180.
Steering vector
Angle feature [4] and directional feature [5]

[4] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, "Multi-channel overlapped speech recognition with location guided speech extraction network," in Proc. IEEE SLT, 2018, pp. 558–565. [5] Z.-Q. Wang, and D. Wang, “On spatial features for supervised speech separation and its application to beamforming and robust ASR,” in Proc. IEEE ICASSP, 2018, pp. 5709–5713.

Motivation, pitch

Current torchaudio already supports extracting various commonly used spectral features like spectrogram, mel spectrogram, MFCC, LFCC, : https://pytorch.org/audio/stable/transforms.html#feature-extractions.

On the other hand, the research on multi-microphone (multi-channel) data often exploits spatial features in addition to the conventional spectral features [R1] [R2] [R3] [R4] [R5]. The use of additional spatial features has proven to bring significant performance improvement. Therefore, adding support for spatial feature extraction would be very helpful to researchers in this field.

[R1] Y. Jiang, D. Wang, R. Liu, and Z. Feng, “Binaural classification for reverberant speech segregation using deep neural networks,” IEEE/ACM Trans. ASLP., vol. 22, no. 12, pp. 2112–2121, 2014.

[R2] X. Zhang, and D. Wang, “Deep learning based binaural speech separation in reverberant environments,” IEEE/ACM Trans. ASLP., vol. 25, no. 5, pp. 1075–1084, 2017.

[R3] Z.-Q. Wang, and D. Wang, “On spatial features for supervised speech separation and its application to beamforming and robust ASR,” in Proc. IEEE ICASSP, 2018, pp. 5709–5713.

[R4] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, "Multi-channel overlapped speech recognition with location guided speech extraction network," in Proc. IEEE SLT, 2018, pp. 558–565.

[R5] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information,” in Proc. ISCA Interspeech, 2019, pp. 4290–4294.

Alternatives

I can find some codebases that provides functions for extracting specific spatial features:

Additional context

Our team from ESPnet is interested in implementing a spatial feature driven multi-channel speech enhancement/separation model, in additional to the existing neural beamformer or FaSNet based models.

Given that there is not a unified and torch-friendly implementation of the aforementioned spatial features, it would be great if torchaudio could provide these features.

Many thanks to @zqwang7 and @popcornell for the helpful discussions!

pytorch / audio