VGGish spectrogram feature extractor has invalid shapes, compared to the paper.

mostafa-mahmoud commented 3 years ago

The output of the feature extractor mel_features.py does not match the details in the paper.

In section 3.1 in the paper, it is mentioned that using non-overlapping frames with 0.96s each yields a log-mel spectrogram with shape (96, 64). However, the code available produces the spectrograms with shape (94, 64).

Steps to reproduce

parameters = {
  'audio_sample_rate': 16000,
  'log_offset': 0.01,
  'window_length_secs': 0.025,
  'hop_length_secs': 0.010,
  'num_mel_bins': 64,
}
data = np.random.normal(size=(int(parameters['audio_sample_rate'] * 0.96),))
spectrogram = log_mel_spectrogram(data, **parameters)
print(spectrogram.shape)

Here is a gist that generates the issue. The parameters match the defaults here.

The output in the paper suggests that there is some kind of padding that was made, the total padding is of a length between [0.015, 0.025) (corresponding to [240, 400) samples), typically half of that is before the track and half is after.

Why does this occur? Is it a padding issue?
If so, what is the padding length exactly? and what is the padding mode? ("CONSTANT", "REFLECT", or "SYMMETRIC")

mostafa-mahmoud commented 3 years ago

After examining vggish_input.py, it seems a full track is first converted to spectrogram and then the 96 frames are taken while discarding the remainder frames at the end (less than 96).

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / models

VGGish spectrogram feature extractor has invalid shapes, compared to the paper. #9938

Steps to reproduce