13 October 2024
AuM is accepted at SPL (Signal Processing Letters): https://ieeexplore.ieee.org/document/1072087126 June 2024
Code cleanup, enhancements and improvements!16 June 2024
Added more details for EPIC-SOUNDS!11 June 2024
Training scripts released!10 June 2024
Setup guide released!07 June 2024
Code released! (Initial release — further setup and cleaning in progress.)06 June 2024
Checkpoints released!05 June 2024
ArXiv Preprint released: https://arxiv.org/abs/2406.0334422 April 2024
OpenReview Preprint released: https://openreview.net/forum?id=RZu0ZlQIUIThis repository contains the implementation of Audio-Mamba (AuM), a generic, self-attention-free and purely state space model designed for audio classification. It provides the necessary code for training and evaluating the model across various audio classification benchmarks. AuM is built on the works AST and ViM, and it utilizes Hugging Face's Accelerate library to facilitate efficient multi-GPU training.
Please run the following commands to set up the repository:
conda create -n aum python=3.10.13
conda activate aum
conda install nvidia/label/cuda-11.8.0::cuda-nvcc
conda install nvidia/label/cuda-11.8.0::cuda
Try:
conda install anaconda::cudnn
Else:
conda install -c conda-forge cudnn
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install causal_conv1d==1.1.3.post1 mamba_ssm==1.1.3.post1
To integrate the modifications for supporting bidirectional processing, copy the mamba_ssm
folder to the site-packages
directory of the Python installation within the Conda environment. This folder is directly borrowed from the ViM repository.
cp -rf vim-mamba_ssm/mamba_ssm $CONDA_PREFIX/lib/python3.10/site-packages
An example notebook for inference is provided in the examples/inference
directory. The notebook demonstrates a minimal example of how to load a trained model and perform inference on a sample audio file.
Each dataset folder within the exps/
directory includes an example evaluation script for AuM (aum_eval.sh
).
Each dataset's training scripts and relevant files are located within their respective folders under the exps/
directory. These folders include:
To execute the training scripts:
exps/vggsound/
).bash aum-base_scratch-vggsound.sh
).Note: The scripts are prepared for execution but require modification of paths (such as experiment directories) to fit your specific setup.
For training on multiple GPUs:
CUDA_VISIBLE_DEVICES
environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1,2,...
).batch_size
argument in the script to the desired batch size per GPU.Note: To maintain the effective batch size from single GPU training, divide the batch size by the number of GPUs.
The EPIC-SOUNDS dataset has a distinct training structure:
epic-sounds/
directory contains only training scripts.config_default.yaml
file located at src/epic_sounds/epic_data/
directory. This file includes paths to the dataset folder, training splits, and other relevant default settings (please modify several of these variables according to your setup). Inside run.py
file, some of the variables from this config file are overriden by the command line arguments. For the full reference regarding this dataset, please refer to the EPIC-SOUNDS repository.
The model checkpoints are available for the following experiments:
These are the checkpoints for the base models with the variant Fo-Bi (b) , trained from scratch. |
Dataset | #Params | Performance | Checkpoint |
---|---|---|---|---|
Audioset (mAP) | 92.1M | 32.74 | Link | |
AS-20K (mAP) | 92.1M | 14.05 | Link | |
VGGSound (Acc) | 91.9M | 42.97 | Link | |
VoxCeleb (Acc) | 92.7M | 33.12 | Link | |
Speech Commands V2 (Acc) | 91.4M | 94.44 | Link | |
Epic Sounds (Acc) | 91.7M | 44.92 | Link |
These are the checkpoints for the small models with the variant Bi-Bi (c) , initialized with ImageNet pretrained weights. |
Dataset | #Params | Performance | Checkpoint |
---|---|---|---|---|
Audioset (mAP) | 25.5M | 39.74 | Link | |
AS-20K (mAP) | 25.5M | 29.17 | Link | |
VGGSound (Acc) | 25.5M | 49.61 | Link | |
VoxCeleb (Acc) | 25.8M | 41.78 | Link | |
Speech Commands V2 (Acc) | 25.2M | 97.61 | Link | |
Epic Sounds (Acc) | 25.4M | 53.45 | Link |
These are the checkpoints for the base models with the variant Fo-Bi (b)
, initialized with AudioSet pretrained weights.
Dataset | #Params | Performance | Checkpoint |
---|---|---|---|
VGGSound (Acc) | 91.9M | 46.78 | Link |
VoxCeleb (Acc) | 92.7M | 41.82 | Link |
Speech Commands V2 (Acc) | 91.4M | 94.82 | Link |
Epic Sounds (Acc) | 91.7M | 48.31 | Link |
If you find this work useful, please consider citing us:
@article{erol2024audio,
title={Audio Mamba: Bidirectional State Space Model for Audio Representation Learning},
author={Erol, Mehmet Hamza and Senocak, Arda and Feng, Jiu and Chung, Joon Son},
journal={arXiv preprint arXiv:2406.03344},
year={2024}
}