Closed helia95 closed 4 years ago
Hi. Thank you for your interest in my work. I think that's because you do not normalize the input features. You can use the code below:
audio, sr = librosa.load(filename, sr=sample_rate, mono=True) filter_banks, energies = fbank(audio, samplerate=sample_rate, nfilt=40, winlen=0.025) filter_banks = 20 * np.log10(np.maximum(filter_banks,1e-5)) feature = normalize_frames(filter_banks, Scale=False)
Here, the function "normalize_frames" is as below:
def normalize_frames(m,Scale=True): if Scale: return (m - np.mean(m, axis=0)) / (np.std(m, axis=0) + 2e-12) else: return (m - np.mean(m, axis=0))
Hello, thanks for this great tutorial! I'm not able to reproduce the feature extraction step, can you please point me to the right direction?
Now I'm using logfbanks from python_speech_features library, with sr=16000, n_filters=40.
Many thanks!