PAM: Prompting Audio-Language Models for Audio Quality Assessment

[Paper] [data]

PAM is a no-reference metric for assessing audio quality for different audio processing tasks. It prompts Audio-Language Models (ALMs) using an antonym prompt strategy to calculate an audio quality score. It does not require reference data or task-specific models and correlates well with human perception. PAM_9 (1)

News

[Jul 24] Improved human correlation across tasks [commit]
[Mar 24] PAM is accepted at INTERSPEECH 2024

Setup

Open the Anaconda terminal and run:

> git clone https://github.com/soham97/PAM.git
> cd PAM 
> conda create -n pam python=3.10
> conda activate pam
> pip install -r requirements.txt

Compute PAM

Folder evaluation

To compute PAM on folder containing audio files, you can directly run:

> python run.py --folder {folder_path}

The symbol {..} indicates user input.

Custom evaluation

To compute PAM on heirarchy of folder or multiple directory, we recommed creating a custom dataset.

In dataset.py creating a custom dataset by inheriting from AudioDataset, similar to ExampleDataset
Modify the get_filelist function to fit to your directory structure
Update the run.py with your custom dataset and make changes to evaluation if needed

Data

The manuscript uses data from multiple sources. It can be obtained as follows:

For the text-to-audio and text-to-music generation, we conducted the human listening test using Amazon Turk. The audio generated by models and human listening scores are available at: Zenodo
For text-to-music generation with FAD comparison (Figure 6), we used the data and human listening scores from Adapting Frechet Audio Distance for Generative Music Evaluation (ICASSP 24). The website is here
- For text-to-speech generation, we used the data and human listening scores from Evaluating speech synthesis by training recognizers on synthetic speech (2023)
- For distortions (Figure 4) we sourced the data from NISQA. The data with human listening scores, can be downloaded from the GitHub repo: here.
- For voice conversion, we use the voice conversion subset from the VoiceMOS Challenge data. The data can be downloaded at: Zenodo

Paper reproduction

This section covers reproducing numbers for text-to-audio and text-to-music. First download the human listening test data by following the instruction listed above. The download should contain a folder titled human_eval.

Then run the following commands.

> python pcc.py --folder {folder_path}

where {folder_path} points to human_eval folder.

Citation

@article{deshmukh2024pam,
  title={PAM: Prompting Audio-Language Models for Audio Quality Assessment},
  author={Soham Deshmukh and Dareen Alharthi and Benjamin Elizalde and Hannes Gamper and Mahmoud Al Ismail and Rita Singh and Bhiksha Raj and Huaming Wang},
  journal={arXiv preprint arXiv:2402.00282},
  year={2023}
}

soham97 / PAM

readme