The output of the feature extractor mel_features.py does not match the details in the paper.
In section 3.1 in the paper, it is mentioned that using non-overlapping frames with 0.96s each yields a log-mel spectrogram with shape (96, 64). However, the code available produces the spectrograms with shape (94, 64).
Here is a gist that generates the issue. The parameters match the defaults here.
The output in the paper suggests that there is some kind of padding that was made, the total padding is of a length between [0.015, 0.025) (corresponding to [240, 400) samples), typically half of that is before the track and half is after.
Why does this occur? Is it a padding issue?
If so, what is the padding length exactly? and what is the padding mode? ("CONSTANT", "REFLECT", or "SYMMETRIC")
After examining vggish_input.py, it seems a full track is first converted to spectrogram and then the 96 frames are taken while discarding the remainder frames at the end (less than 96).
The output of the feature extractor mel_features.py does not match the details in the paper.
In section 3.1 in the paper, it is mentioned that using non-overlapping frames with 0.96s each yields a log-mel spectrogram with shape
(96, 64)
. However, the code available produces the spectrograms with shape(94, 64)
.Steps to reproduce
Here is a gist that generates the issue. The parameters match the defaults here.
The output in the paper suggests that there is some kind of padding that was made, the total padding is of a length between
[0.015, 0.025)
(corresponding to[240, 400)
samples), typically half of that is before the track and half is after."CONSTANT"
,"REFLECT"
, or"SYMMETRIC"
)