tensorflow / models

Models and examples built with TensorFlow
Other
76.97k stars 45.79k forks source link

How can we relate the parameters that are defined in the VGGish network with nperseg and NFFT. #8539

Closed tusharpoddar closed 4 years ago

tusharpoddar commented 4 years ago
I have been working to change the VGGISH model for my research. In order to get the spectrograms that are provided to me using the vggish_input file, I am using the scipy function called specgram which takes as inputs the wav file and parameters like nperseg and NFFT. I got a really apt picture of my spectrogram for nperseg 128 and NFFT 512. How can I think of these two values in terms of the parameters defined in the vggish_param file?
tusharpoddar commented 4 years ago

file_len=10 -----> size of the wav file that i am using in minutes. slice_len=25. -------> size of each window size in miliseconds. step_size=0.9 ------> hop

spec_window=128. --------------> nperseg NFFT=512

How do i convert these to the parameters given in the vggish_params file.

dpwe commented 4 years ago

I assume you're using scipy.signal.spectrogram (specgram is defined in matplotlib.pyplot, but doesn't have a nperseg argument).

vggish takes log-mel-spectrum slices as input, as calculated by log_mel_spectrogram.

This calculates a spectrogram internally with:

  spectrogram = stft_magnitude(
      data,
      fft_length=fft_length,
      hop_length=hop_length_samples,
      window_length=window_length_samples)

I think you can replace this with a call to scipy.signal.spectrogram like:

  spectrogram = np.abs(scipy.signal.spectrogram(
      data,
      nperseg=window_length_samples,
      noverlap=nperseg - hop_length_samples,
      nfft=fft_length)[2]).transpose()

This uses a different window than the the built-in stft_magnitude (Tukey with with r=0.25 vs. periodic Hann), but the results will be pretty similar.

window_length_samples is the number of samples in params.STFT_WINDOW_LENGTH_SECONDS (i.e., integer value closest to the time in seconds multiplied by the sample rate), and hop_length_samples is the number of samples in params.STFT_HOP_LENGTH_SECONDS. If you alter these (from the defaults of 0.025 and 0.010), the input to the classifier won't match the training data, and the results may not be particularly meaningful. It's just an embedding though, so maybe it will be useful.

tusharpoddar commented 4 years ago

Also, you said that if I change the parameters, the input to the classifier wont match. How do I train my own model, Like how do I give it my own training data?