harritaylor / torchvggish

Pytorch port of Google Research's VGGish model used for extracting audio features.
Apache License 2.0
377 stars 68 forks source link

Output size #17

Closed jampeg closed 4 years ago

jampeg commented 4 years ago

Hello,

I tried this implementation along with Usage. The size of output tensor is [19, 128]. Do I need to fuse the output tensor in order to convert the output tensor from [19,128] to [1,128] ?
Is the audio-embedding obtained in each audiofile, or in each batch?

In my understanfing, It can be obtained in each audiofile.

harritaylor commented 4 years ago

Hi, Is your input file ~19 seconds long? That would explain the size. For example a single file is: [n_seconds, features], whereas a batched input would be [n_audio_files, n_seconds, features].

Hope that makes sense.

jampeg commented 4 years ago

Hi, Thank you for your reply. Yes, the size of output audio-embedding is [19,128] when the input file is about 19 seconds. However, I want the output audio-embedding whose shape is [1,128]. (I need 128-dimensions audio-embedding in each input audio file even though the input audio file has N seconds long.) Do I need to fuse processing after getting the output audio-embedding [19,128]?

Best Regards.

SrijithBalachander commented 4 years ago

Hi, I have the same doubt as the one @jampeg raised. VGGish is supposed to give a 128x1 embedding for an audio file if I am not wrong? So how do we fuse the [n_seconds, features] to be [1,features]?

Thanks

Edit: I missed a detail while reading about VGGish. It doesn't give 128 features for each audio file but for ~1 second audio signals. So I guess we will get a [n_seconds, features] size vector for an input audio. But my question still remains: if I were to do a classification using VGGish as extractor, how would I fuse the output for an audio file to get 1 x 128 features. And this has to be consistent for each audio file no matter the duration.

harritaylor commented 4 years ago

Hi, Thank you for your reply. Yes, the size of output audio-embedding is [19,128] when the input file is about 19 seconds. However, I want the output audio-embedding whose shape is [1,128]. (I need 128-dimensions audio-embedding in each input audio file even though the input audio file has N seconds long.) Do I need to fuse processing after getting the output audio-embedding [19,128]?

Best Regards.

I understand now.

@jampeg, @SrijithBalachander - I don't know the best way of achieving this, but I would recommend you take a look at work on temporal aggregation, such as ActionVLAD - particularly the section on "Modelling long-term temporal structure" (http://openaccess.thecvf.com/content_cvpr_2017/papers/Girdhar_ActionVLAD_Learning_Spatio-Temporal_CVPR_2017_paper.pdf). I am not an expert on this, but that should be a good starting point for solving your problem. Alternatively, you could just take the mean of your 19 feature vectors, but you will lose some resolution this way I imagine.

I'll close the issue now if that's all. Please @ me if you need any further advice. Thanks.