pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 651 forks source link

🚀 Feature Request: Add Kaldi Pitch Feature #686

Closed mthrok closed 3 years ago

mthrok commented 4 years ago

🚀 Feature

Add feature that is equivalent to Kaldi's compute-kaldi-pitch-feats.

Motivation

From https://github.com/pytorch/audio/issues/679#issuecomment-638446056

We found that the pitch feature always improved the performance for several tonal languages (e.g., Chinese), and did not degrade the performance for the other languages. So, espnet1 decided to use log Mel filterbank + pitch features as default. However, the pitch feature extraction is rather complicated, and we had some difficulties in making this pitch feature extraction fully written by torch functions. So, espnet2 decided to only use log Mel filterbank features, instead. We still observe a slight degradation of the ASR performance, but that can be mitigated by some tuning. We're now moving to espnet2 so we don't need it in the long term, but probably it is quite beneficial for the short term or people keep to use espnet1.

mthrok commented 4 years ago

@sw005320 For the reference, could you give me the pointer to ESPNet1's implementation of pitch?

sw005320 commented 4 years ago

We simply call Kaldi pitch extraction. We don't have our own pitch extraction.

mthrok commented 4 years ago

I see, thanks!

mthrok commented 4 years ago

Some thoughts on spec:

Interface

def compute_pitch_feats(
        waveform: Tensor,
        delta_pitch: float = 0.005,
        frame_length: float = 25.,
        frame_shift: float = 10.,
        frames_per_chunk: int = 0,
        lowpass_cutoff: float = 1000.,
        lowpass_filter_width: int = 1,
        max_f0: float = 400.,
        max_frames_latency: int = 0,
        min_f0: float = 50.,
        nccf_ballast: float = 7000.,
        nccf_ballast_online: bool = False,
        penalty_factor: float = 0.1,
        recompute_frame: int = 500,
        resample_frequency: float = 4000,
        sample_frequency: float = 16000,
        simulate_first_pass_online: bool = False,
        snip_edges: bool = True,
        soft_min_f0: float = 10.,
        upsample_filter_width: int = 5,
) -> Tensor:
    ...

Implementation

Test

  1. A new test suite in Kaldi compatibility test
  2. A set of parameters to be tested. Similar to #689
mthrok commented 4 years ago

@sw005320 I am looking at Kaldi implementation and wondering if we can limit the number of parameters to expose. For example, I do not think we need parameters for online feature extractions. Do you have a set of parameters you think will be changing?

https://kaldi-asr.org/doc/pitch-functions_8h_source.html#l00042

sw005320 commented 4 years ago

Sorry for my late response... We usually only change the sampling frequency (yes, it is necessary), and keep the other parameters default, but it's robustly working on various ASR tasks.

Also, I did not try the online pitch feature and I could not mention this part...

mthrok commented 3 years ago

Kaldi pitch feature was added in #1243, and will be released as a beta feature in upcoming 0.8.0. We welcome feedback on the feature.