Closed mulatikhr closed 2 weeks ago
Hi, thank you for your interest in our work!
While the provided model checkpoints can technically handle audio clips of 3-4 minutes during inference, their performance may decline with longer inputs, as they were trained on 10-second audio segments across all datasets. To achieve reliable results with longer audio, retraining or fine-tuning the model is recommended.
To help you get started, here are some resources in the repository that may be useful:
examples/inference
for guidance on running inference with an existing model checkpoint. This can help forming a better understanding around critical model arguments for usage in external datasets.exps/
folder could be helpful. For instance:
exps/vggsound/aum-base_scratch-vggsound.sh
--> for training a model from scratchexps/vggsound/aum-base_audioset-vggsound.sh
--> for fine-tuning an already trained model (here on Audioset) on another dataset (here VGGSound)src/run.py
and src/dataloader.py
files may help you for a better understanding of the factors regarding data such as loading and processing.
I very much hope to get a reply from you, I'm very interested in this paper of yours