Author: Irene Martin, Tampere University Email.
pip install -r requirements.txt
.python feature_extraction.py
.python task4b.py
or ./task4b.py
To setup Anaconda environment for the system use following:
conda create --name dcase-t4b python=3.6
conda activate dcase-t4b
conda install numpy
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
pip install torchinfo
pip install librosa
pip install pandas
pip install sklearn
pip install sed_eval
pip install dcase_util
pip install sed_scores_eval
This is the baseline system for the subtask of the Sound Event Detection task 4 of the Acoustic Scene Classification in Detection and Classification of Acoustic Scenes and Events 2023 (DCASE2023) challenge. The system is intended to provide a simple entry-level state-of-the-art approach that gives reasonable results. The baseline system is built on dcase_util toolbox (>=version 0.2.16).
Participants can build their own systems by extending the provided baseline system. The system is very simple, it does not handle dataset download, but a simple feature extraction code is provided. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the soft label scenario, numbers between 0 and 1.
If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system could potentially make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.
extract_features.py # Code to extract features from the development files
MAESTRO Real - Multi-Annotator Estimated Strong Labels is used as development dataset for this task.
This task is a subtopic of the Sound Event Detection Task 4, which provides three kinds of data for training; weakly-labeled data (without timestamps), strongly-labeled data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording
This task is concerned about another type of training data:
The task specific baseline system is implemented in file model.py
.
The system implements a convolutional recurrent neural network (CRNN) based approach, with three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.
Input shape: sequence_length * 64
Architecture:
Learning (epochs: 150, batch size: 32, data shuffling between epochs)
Model selection:
Network summary
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 1, 200, 64)] 0
_________________________________________________________________
conv2d (None, 128, 200, 64) 1280
_________________________________________________________________
batch_normalization (None, 128, 200, 64) 256
_________________________________________________________________
max_pooling2d (None, 128, 200, 12) 0
_________________________________________________________________
dropout (None, 128, 200, 12) 0
_________________________________________________________________
conv2d_1 (None, 128, 200, 12) 147584
_________________________________________________________________
batch_normalization_1 (None, 128, 200, 12) 256
_________________________________________________________________
max_pooling2d_1 (None, 128, 200, 6) 0
_________________________________________________________________
dropout_1 (None, 128, 200, 6) 0
_________________________________________________________________
conv2d_2 (None, 128, 200, 6) 147584
_________________________________________________________________
batch_normalization_2 (None, 128, 200, 6) 256
_________________________________________________________________
max_pooling2d_2 (None, 128, 200, 3) 0
_________________________________________________________________
dropout_2 (None, 128, 200, 3) 0
_________________________________________________________________
permute (None, 200, 128, 3) 0
_________________________________________________________________
reshape_1 (None, 200, 384) 0
_________________________________________________________________
bidirectional (None, 200, 64) 80256
_________________________________________________________________
Linear_1 (None, 200, 32) 2080
_________________________________________________________________
Linear_2 (None, 200, 17) 561
=================================================================
A cross-validation setup is used to evaluate the performance of the baseline system. Micro-averaged-scores (ER_m, F1_m) and macro-averaged-score (F1_M) are calculated using sed-eval toolbox segment-based 1 second. Macro-averaged score class-wise with optimal threshold (F1_{th_op}) is calculated using sed-scores-eval segment-based 1 second.
| | ER_m | F1_m | F1_M | F1_{th_op} |
|----------|--------------------|---------------------|---------------------|---------------------|
| Baseline | 0.487 (+/-0.009) | 70.34% (+/-0.766) | 35.83% (+/-0.660) | 42.87% (+/-0.840) |
Note: The reported system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results. Results from the table are obtained from training and testing the model 10 times, mean and standard deviation of the performance from these 10 independent trials are shown.
For running the CRNN model:
extract_features.py
, first extract mel-bands and normalize datatask4b.py
, DCASE2023 baseline for Task4BThe code is built on dcase_util toolbox, see manual for tutorials. The machine learning part of the code is built on Pytorch (v1.10.2).
.
├── task4b.py # Baseline system for subtask B
|
├── utils.py # Common functions shared between tasks
├── data_generator.py # File for the dataset
├── extract_features.py # Functions to extract mel-band features and normalize
├── config.py # Common parameters
├── evaluate.py # Perform model evaluation, sed-eval segment-based
├── model.py # CRNN model implementation
|
├── development_folds # Folder with the splits for 5-CV
| - fold1_train.csv
| - fold1_val.csv
| - fold1_test.csv
| - ...
├── metadata
| - development_metadata.csv # File duration information to calcualte sed-scores-eval
| - gt_dev.csv # Ground truth labels (hard-labels)
|
├── development_split.csv # List all the files
├── README.md # This file
└── requirements.txt # External module dependencies