Open Jerry2001 opened 1 year ago
Also, for fsd50k, for sound files with different lengths in the same batch, should I pad them with 0 to the same length before passing them to the model?
Hi! thank you for your interest!
The problem is that the model you've loaded was trained on 10-second clips (Audioset and cropped FSD50k) and audio file that you're processing is longer than 10 seconds (must be around 13.5 seconds from the error) and therefore there is not enough trained time pos encoding to cover the 13.5 seconds.
The get_scene_embeddings
takes care of this, by checking if the audio is longer than the largest legnth the model can handle here
This is only a problem for inputs longer than 10-seconds, the model can handle shorter clips here by cropping the time positional encodings
to match the input. If you use batched inputs, then you can pad shorter clips. If you're doing the inference one by one then the only constraint is to have enough time positional encodings
to cover the whole input. One possible work around is to get (overlapping) windows of 10 seconds and average the resulting embeding, this is done here
During training I'm cropping and padding the raw waveforms with zeros here.
I hope this helps.
Hello,
First of all, thank you for the awesome and very well-written paper and repo.
I currently want to use the embedding of these pre-trained models for my project. The following is the inference code I wrote for fsd50k.
When I do
embed.shape
I gettorch.Size([3, 1295])
, so I basically get what I need already. But, I double check to try get the logit throughmodel()
and it give me the following error:However, I tried a few other audios in fsd50k, and some were able to give me logits and the correct prediction, but some just give errors like this. What could the issues be? Do I need to worry about it, or could I use the embedding? My other question is whether the input batch is fixed? For the model I loaded, I have to input the batch of 3 audio. Is there a way for me to input a different batch?