I would like to ask, can this model handle 3 to 4 minutes long audio? Can it encode such long audio?

Hi, thank you for your interest in our work!

While the provided model checkpoints can technically handle audio clips of 3-4 minutes during inference, their performance may decline with longer inputs, as they were trained on 10-second audio segments across all datasets. To achieve reliable results with longer audio, retraining or fine-tuning the model is recommended.

To help you get started, here are some resources in the repository that may be useful:

examples/inference for guidance on running inference with an existing model checkpoint. This can help forming a better understanding around critical model arguments for usage in external datasets.
Training and fine-tuning scripts under the exps/ folder could be helpful. For instance:
- exps/vggsound/aum-base_scratch-vggsound.sh --> for training a model from scratch
- exps/vggsound/aum-base_audioset-vggsound.sh --> for fine-tuning an already trained model (here on Audioset) on another dataset (here VGGSound)
src/run.py and src/dataloader.py files may help you for a better understanding of the factors regarding data such as loading and processing.

kaistmm / Audio-Mamba-AuM

I would like to ask, can this model handle 3 to 4 minutes long audio? Can it encode such long audio? #6