Closed jampeg closed 4 years ago
Hi,
Is your input file ~19 seconds long? That would explain the size. For example a single file is: [n_seconds, features]
, whereas a batched input would be [n_audio_files, n_seconds, features]
.
Hope that makes sense.
Hi, Thank you for your reply. Yes, the size of output audio-embedding is [19,128] when the input file is about 19 seconds. However, I want the output audio-embedding whose shape is [1,128]. (I need 128-dimensions audio-embedding in each input audio file even though the input audio file has N seconds long.) Do I need to fuse processing after getting the output audio-embedding [19,128]?
Best Regards.
Hi,
I have the same doubt as the one @jampeg raised.
VGGish is supposed to give a 128x1 embedding for an audio file if I am not wrong? So how do we fuse the [n_seconds, features]
to be [1,features]
?
Thanks
Edit: I missed a detail while reading about VGGish. It doesn't give 128 features for each audio file but for ~1 second audio signals. So I guess we will get a [n_seconds, features]
size vector for an input audio. But my question still remains: if I were to do a classification using VGGish as extractor, how would I fuse the output for an audio file to get 1 x 128 features. And this has to be consistent for each audio file no matter the duration.
Hi, Thank you for your reply. Yes, the size of output audio-embedding is [19,128] when the input file is about 19 seconds. However, I want the output audio-embedding whose shape is [1,128]. (I need 128-dimensions audio-embedding in each input audio file even though the input audio file has N seconds long.) Do I need to fuse processing after getting the output audio-embedding [19,128]?
Best Regards.
I understand now.
@jampeg, @SrijithBalachander - I don't know the best way of achieving this, but I would recommend you take a look at work on temporal aggregation, such as ActionVLAD - particularly the section on "Modelling long-term temporal structure" (http://openaccess.thecvf.com/content_cvpr_2017/papers/Girdhar_ActionVLAD_Learning_Spatio-Temporal_CVPR_2017_paper.pdf). I am not an expert on this, but that should be a good starting point for solving your problem. Alternatively, you could just take the mean of your 19 feature vectors, but you will lose some resolution this way I imagine.
I'll close the issue now if that's all. Please @ me if you need any further advice. Thanks.
Hello,
I tried this implementation along with Usage. The size of output tensor is [19, 128]. Do I need to fuse the output tensor in order to convert the output tensor from [19,128] to [1,128] ?
Is the audio-embedding obtained in each audiofile, or in each batch?
In my understanfing, It can be obtained in each audiofile.