kaistmm / Audio-Mamba-AuM

Official Implementation of the work "Audio Mamba: Bidirectional State Space Model for Audio Representation Learning"
53 stars 6 forks source link
audio audio-classification audio-mamba deep-learning mamba pytorch representation-learning speaker-identification speech-classification state-space-model
# Audio-Mamba (AuM) ## Bidirectional State Space Model for Audio Representation Learning ArXiv Preprint: [https://arxiv.org/abs/2406.03344](https://arxiv.org/abs/2406.03344)

News

Index

Overview

This repository contains the implementation of Audio-Mamba (AuM), a generic, self-attention-free and purely state space model designed for audio classification. It provides the necessary code for training and evaluating the model across various audio classification benchmarks. AuM is built on the works AST and ViM, and it utilizes Hugging Face's Accelerate library to facilitate efficient multi-GPU training.

Pipeline

Setting Up the Repository

Please run the following commands to set up the repository:

Create a Conda Environment

conda create -n aum python=3.10.13
conda activate aum

Setting Up CUDA and CuDNN

conda install nvidia/label/cuda-11.8.0::cuda-nvcc
conda install nvidia/label/cuda-11.8.0::cuda

Try: 
conda install anaconda::cudnn
Else:
conda install -c conda-forge cudnn

Installing PyTorch and Other Dependencies

pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Installing Mamba Related Packages

pip install causal_conv1d==1.1.3.post1 mamba_ssm==1.1.3.post1

Enabling Bidirectional SSM Processing

To integrate the modifications for supporting bidirectional processing, copy the mamba_ssm folder to the site-packages directory of the Python installation within the Conda environment. This folder is directly borrowed from the ViM repository.

cp -rf vim-mamba_ssm/mamba_ssm $CONDA_PREFIX/lib/python3.10/site-packages

Inference

Example Inference

An example notebook for inference is provided in the examples/inference directory. The notebook demonstrates a minimal example of how to load a trained model and perform inference on a sample audio file.

Evaluation Scripts

Each dataset folder within the exps/ directory includes an example evaluation script for AuM (aum_eval.sh).

Training

Overview

Each dataset's training scripts and relevant files are located within their respective folders under the exps/ directory. These folders include:

Executing Training Scripts

To execute the training scripts:

  1. Navigate to the dataset's directory (e.g., exps/vggsound/).
  2. Run the corresponding script (e.g., bash aum-base_scratch-vggsound.sh).

Note: The scripts are prepared for execution but require modification of paths (such as experiment directories) to fit your specific setup.

Multiple GPU Training

For training on multiple GPUs:

  1. Set GPU IDs: List the GPU IDs in the CUDA_VISIBLE_DEVICES environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1,2,...).
  2. Adjust Batch Size: Set the batch_size argument in the script to the desired batch size per GPU.

Note: To maintain the effective batch size from single GPU training, divide the batch size by the number of GPUs.

EPIC-SOUNDS Dataset

The EPIC-SOUNDS dataset has a distinct training structure:

For the full reference regarding this dataset, please refer to the EPIC-SOUNDS repository.

Model Checkpoints

The model checkpoints are available for the following experiments:

Base Scratch

These are the checkpoints for the base models with the variant Fo-Bi (b), trained from scratch. Dataset #Params Performance Checkpoint
Audioset (mAP) 92.1M 32.74 Link
AS-20K (mAP) 92.1M 14.05 Link
VGGSound (Acc) 91.9M 42.97 Link
VoxCeleb (Acc) 92.7M 33.12 Link
Speech Commands V2 (Acc) 91.4M 94.44 Link
Epic Sounds (Acc) 91.7M 44.92 Link

Small ImageNet

These are the checkpoints for the small models with the variant Bi-Bi (c), initialized with ImageNet pretrained weights. Dataset #Params Performance Checkpoint
Audioset (mAP) 25.5M 39.74 Link
AS-20K (mAP) 25.5M 29.17 Link
VGGSound (Acc) 25.5M 49.61 Link
VoxCeleb (Acc) 25.8M 41.78 Link
Speech Commands V2 (Acc) 25.2M 97.61 Link
Epic Sounds (Acc) 25.4M 53.45 Link

Base AudioSet

These are the checkpoints for the base models with the variant Fo-Bi (b), initialized with AudioSet pretrained weights.

Dataset #Params Performance Checkpoint
VGGSound (Acc) 91.9M 46.78 Link
VoxCeleb (Acc) 92.7M 41.82 Link
Speech Commands V2 (Acc) 91.4M 94.82 Link
Epic Sounds (Acc) 91.7M 48.31 Link

Citation

If you find this work useful, please consider citing us:

@article{erol2024audio,
  title={Audio Mamba: Bidirectional State Space Model for Audio Representation Learning},
  author={Erol, Mehmet Hamza and Senocak, Arda and Feng, Jiu and Chung, Joon Son},
  journal={arXiv preprint arXiv:2406.03344},
  year={2024}
}