Open Emrys365 opened 2 years ago
Hi @Emrys365, the spatial feature proposal LGTM, those features are very standard for multi-channel processing.
Currently I don't have the bandwidth to work on it, but open-source contributions are welcome. We can use setk
as the reference library for verifying the correctness. cc @funcwj
π The feature
In addition to the readily available spectral features (https://pytorch.org/audio/stable/transforms.html#feature-extractions), I would like to propose a request for extracting spatial features from the multi-microphone (multi-channel) speech data.
There are many commonly used spatial features in the literature, including:
Motivation, pitch
Current torchaudio already supports extracting various commonly used spectral features like spectrogram, mel spectrogram, MFCC, LFCC, : https://pytorch.org/audio/stable/transforms.html#feature-extractions.
On the other hand, the research on multi-microphone (multi-channel) data often exploits spatial features in addition to the conventional spectral features [R1] [R2] [R3] [R4] [R5]. The use of additional spatial features has proven to bring significant performance improvement. Therefore, adding support for spatial feature extraction would be very helpful to researchers in this field.
Alternatives
I can find some codebases that provides functions for extracting specific spatial features:
Additional context
Our team from ESPnet is interested in implementing a spatial feature driven multi-channel speech enhancement/separation model, in additional to the existing neural beamformer or FaSNet based models.
Given that there is not a unified and torch-friendly implementation of the aforementioned spatial features, it would be great if torchaudio could provide these features.
Many thanks to @zqwang7 and @popcornell for the helpful discussions!