pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.49k stars 643 forks source link

Add support for spatial feature extraction on multi-microphone data #2577

Open Emrys365 opened 2 years ago

Emrys365 commented 2 years ago

πŸš€ The feature

In addition to the readily available spectral features (https://pytorch.org/audio/stable/transforms.html#feature-extractions), I would like to propose a request for extracting spatial features from the multi-microphone (multi-channel) speech data.

There are many commonly used spatial features in the literature, including:

  1. Inter-channel Phase Difference (IPD) [1]: based on this, we can calculate cosIPD and sinIPD

    [1] J. Benesty, J. Chen, and Y. Huang, β€œMicrophone array signal processing,” Springer Science & Business Media, vol. 1, 2008.

  2. Inter-channel Level Difference (ILD) [1]
  3. Generalized Cross-Correlation (GCC) [2], GCC with Phase Transform (GCC-PHAT) [2] and Steered-Response Power Phase Transform (SRP-PHAT) [3]

    [2] C. Knapp, and G. Carter, β€œThe generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976. [3] J. H. DiBaise, H. F Silverman and M. S. Brandstein, β€œMicrophone Arrays Signal Processing Techniques and Applications: Robust Localization in Reverberant Rooms,” Ed. Berlin, Germany: Springer, 2001, pp. 157-180.

  4. Steering vector
  5. Angle feature [4] and directional feature [5]

    [4] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, "Multi-channel overlapped speech recognition with location guided speech extraction network," in Proc. IEEE SLT, 2018, pp. 558–565. [5] Z.-Q. Wang, and D. Wang, β€œOn spatial features for supervised speech separation and its application to beamforming and robust ASR,” in Proc. IEEE ICASSP, 2018, pp. 5709–5713.

Motivation, pitch

Current torchaudio already supports extracting various commonly used spectral features like spectrogram, mel spectrogram, MFCC, LFCC, : https://pytorch.org/audio/stable/transforms.html#feature-extractions.

On the other hand, the research on multi-microphone (multi-channel) data often exploits spatial features in addition to the conventional spectral features [R1] [R2] [R3] [R4] [R5]. The use of additional spatial features has proven to bring significant performance improvement. Therefore, adding support for spatial feature extraction would be very helpful to researchers in this field.

[R1] Y. Jiang, D. Wang, R. Liu, and Z. Feng, β€œBinaural classification for reverberant speech segregation using deep neural networks,” IEEE/ACM Trans. ASLP., vol. 22, no. 12, pp. 2112–2121, 2014.

[R2] X. Zhang, and D. Wang, β€œDeep learning based binaural speech separation in reverberant environments,” IEEE/ACM Trans. ASLP., vol. 25, no. 5, pp. 1075–1084, 2017.

[R3] Z.-Q. Wang, and D. Wang, β€œOn spatial features for supervised speech separation and its application to beamforming and robust ASR,” in Proc. IEEE ICASSP, 2018, pp. 5709–5713.

[R4] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, "Multi-channel overlapped speech recognition with location guided speech extraction network," in Proc. IEEE SLT, 2018, pp. 558–565.

[R5] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, β€œNeural spatial filter: Target speaker speech separation assisted with directional information,” in Proc. ISCA Interspeech, 2019, pp. 4290–4294.

Alternatives

I can find some codebases that provides functions for extracting specific spatial features:

  1. IPD:
  2. GCC, GCC-PHAT and SRP-PHAT:
  3. Steering vector:
  4. Directional / Angle feature:

Additional context

Our team from ESPnet is interested in implementing a spatial feature driven multi-channel speech enhancement/separation model, in additional to the existing neural beamformer or FaSNet based models.

Given that there is not a unified and torch-friendly implementation of the aforementioned spatial features, it would be great if torchaudio could provide these features.

Many thanks to @zqwang7 and @popcornell for the helpful discussions!

nateanl commented 2 years ago

Hi @Emrys365, the spatial feature proposal LGTM, those features are very standard for multi-channel processing.

Currently I don't have the bandwidth to work on it, but open-source contributions are welcome. We can use setk as the reference library for verifying the correctness. cc @funcwj