Author: Irene Martin, Tampere University Email. Adaptations from the original code DCASE2020 - Task 1 by Toni Heittola, Tampere University
pip install -r requirements.txt
.python prepare_data.py
.python create_h5.py --dataset_file='/TAUUrbanAcousticScenes_2022_Mobile_DevelopmentSet/meta.csv' --workspace='path' --data_type='dev'
. python task1.py
or ./task1.py
To setup Anaconda environment for the system use following:
conda create --name tf2-dcase python=3.6
conda activate tf2-dcase
conda install ipython
conda install numpy
conda install tensorflow-gpu=2.1.0
conda install -c anaconda cudatoolkit
conda install -c anaconda cudnn
pip install librosa
pip install absl-py==0.9.0
pip install sed_eval
pip install pyyaml==5.4
pip install dcase_util
pip install pandas
pip install pyparsing==2.2.0
This is the baseline system for the Low-Complexity Acoustic Scene Classification in Detection and Classification of Acoustic Scenes and Events 2022 (DCASE2022) challenge. The system is intended to provide a simple entry-level state-of-the-art approach that gives reasonable results. The baseline system is built on dcase_util toolbox (>=version 0.2.16).
Participants can build their own systems by extending the provided baseline system. The system is very simple, it does not handle dataset download or feature extraction, it loads the data from .h5 structure. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.
If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system could potentially make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.
|
├── task1_features.yaml # Parameters for the prepare_data.py file
├── prepare_data.py # Code to extract features from 1 second files
└── create_h5.py # Code to create the features_all.h5 file
TAU Urban Acoustic Scenes 2022 Mobile Development dataset is used as development dataset for this task.
This subtask is concerned with the basic problem of acoustic scene classification, in which it is required to classify a test audio recording into one of ten known acoustic scene classes. This task targets generalization properties of systems across a number of different devices, and will use audio data recorded and simulated with a variety of devices.
Recordings in the dataset were made with three devices (A, B and C) that captured audio simultaneously and 6 simulated devices (S1-S6). Each acoustic scene has 14400 segments recorded with device A (main device) and 1080 segments of parallel audio each recorded with devices B, C, and S1-S6. The dataset contains in total 64 hours of audio. For a more detailed description see DCASE Challenge task description.
The task targets low complexity solutions for the classification problem in terms of model size, and uses audio recorded with a single device (device A, 48 kHz / 24bit / stereo). The data for the dataset was recorded in 10 acoustic scenes which were later grouped into three major classes used in this subtask. The dataset contains in total 40 hours of audio. For a more detailed description see DCASE Challenge task description.
The computational complexity will be measured in terms of parameter count and MMACs (million multiply-accumulate operations).
See detailed description how to calculate model size from DCASE Challenge task description. Model calculation for TFLITE models is implemented using nessi
The task specific baseline system is implemented in file task1.py
.
The system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 1-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals. Model size of the baseline when using keras model quantization is 46.51 KB when using TFLite quantization and the MACS count is 29.23 M.
Network summary
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (None, 40, 51, 16) 800
_________________________________________________________________
batch_normalization_1 (None, 40, 51, 16) 64
_________________________________________________________________
activation_1 (None, 40, 51, 16) 0
_________________________________________________________________
conv2d_2 (None, 40, 51, 16) 12560
_________________________________________________________________
batch_normalization_2 (None, 40, 51, 16) 64
_________________________________________________________________
activation_2 (None, 40, 51, 16) 0
_________________________________________________________________
max_pooling2d_1 (None, 8, 10, 16) 0
_________________________________________________________________
dropout_1 (None, 8, 10, 16) 0
_________________________________________________________________
conv2d_3 (None, 8, 10, 32) 25120
_________________________________________________________________
batch_normalization_3 (None, 8, 10, 32) 128
_________________________________________________________________
activation_3 (None, 8, 10, 32) 0
_________________________________________________________________
max_pooling2d_2 (None, 2, 1, 32) 0
_________________________________________________________________
dropout_2 (None, 2, 1, 32) 0
_________________________________________________________________
flatten_1 (None, 64) 0
_________________________________________________________________
dense_1 (None, 100) 6500
_________________________________________________________________
dropout_3 (None, 100) 0
_________________________________________________________________
dense_2 (None, 10) 1010
=================================================================
Input shape : (None, 40, 51, 1)
Output shape : (None, 10)
The cross-validation setup provided with the TAU Urban Acoustic Scenes 2022 Mobile Development dataset is used to evaluate the performance of the baseline system. Results are calculated using TensorFlow in GPU mode (using Nvidia Tesla V100 GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times, and mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.
Scene label | Log Loss | A | B | C | S1 | S2 | S3 | S4 | S5 | S6 | Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|
Airport | 1.534 | 1.165 | 1.439 | 1.475 | 1.796 | 1.653 | 1.355 | 1.608 | 1.734 | 1.577 | 39.4% |
Bus | 1.758 | 1.073 | 1.842 | 1.206 | 1.790 | 1.580 | 1.681 | 2.202 | 2.152 | 2.293 | 29.3% |
Metro | 1.382 | 0.898 | 1.298 | 1.183 | 2.008 | 1.459 | 1.288 | 1.356 | 1.777 | 1.166 | 47.9% |
Metro station | 1.672 | 1.582 | 1.641 | 1.833 | 2.010 | 1.857 | 1.613 | 1.643 | 1.627 | 1.247 | 36.0% |
Park | 1.448 | 0.572 | 0.513 | 0.725 | 1.615 | 1.130 | 1.678 | 2.314 | 1.875 | 2.613 | 58.9% |
Public square | 2.265 | 1.442 | 1.862 | 1.998 | 2.230 | 2.133 | 2.157 | 2.412 | 2.831 | 3.318 | 20.8% |
Shopping mall | 1.385 | 1.293 | 1.291 | 1.354 | 1.493 | 1.292 | 1.424 | 1.572 | 1.245 | 1.497 | 51.4% |
Pedestrian street | 1.822 | 1.263 | 1.731 | 1.772 | 1.540 | 1.805 | 1.869 | 2.266 | 1.950 | 2.205 | 30.1% |
Traffic street | 1.025 | 0.830 | 1.336 | 1.023 | 0.708 | 1.098 | 1.147 | 0.957 | 0.634 | 1.489 | 70.6% |
Tram | 1.462 | 0.973 | 1.434 | 1.169 | 1.017 | 1.579 | 1.098 | 1.805 | 2.176 | 1.903 | 44.6% |
------------- | -------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------- |
Average | 1.575 (+/-0.018) |
1.109 | 1.439 | 1.374 | 1.621 | 1.559 | 1.531 | 1.813 | 1.800 | 1.931 | 42.9% (+/-0.770) |
Note: The reported system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.
TFLite acoustic model
Tensor information (weights excluded, grouped by layer type):
Id | Tensor | Shape | Size in RAM (B) |
---|---|---|---|
0 | Identity_int8 | (1, 10) | 10 |
1 | conv2d_input_int8 | (1, 40, 51, 1) | 2,040 |
2 | sequential/activation/Relu | (1, 40, 51, 16) | 32,640 |
3 | sequential/activation_1/Relu | (1, 40, 51, 16) | 32,640 |
4 | sequential/activation_2/Relu | (1, 8, 10, 32) | 2,560 |
13 | sequential/dense/Relu | (1, 100) | 100 |
14 | sequential/dense_1/BiasAdd | (1, 10) | 10 |
17 | sequential/max_pooling2d/MaxPool | (1, 8, 10, 16) | 1,280 |
18 | sequential/max_pooling2d_1/MaxPool | (1, 2, 1, 32) | 64 |
19 | conv2d_input | (1, 40, 51, 1) | 8,160 |
20 | Identity | (1, 10) | 40 |
Operator execution schedule:
Operator (output name) | Tensors in memory (IDs) | Memory use (B) | MACs | Size |
---|---|---|---|---|
conv2d_input_int8 | [1, 19] | 10,200 | 0 | 0 |
sequential/activation/Relu | [1, 2] | 34,680 | 1,599,360 | 848 |
sequential/activation_1/Relu | [2, 3] | 65,280 | 25,589,760 | 12,608 |
sequential/max_pooling2d/MaxPool | [3, 17] | 33,920 | 32,000 | 0 |
sequential/activation_2/Relu | [4, 17] | 3,840 | 2,007,040 | 25,216 |
sequential/max_pooling2d_1/MaxPool | [4, 18] | 2,624 | 2,560 | 0 |
sequential/dense/Relu | [13, 18] | 164 | 3,200 | 6,800 |
sequential/dense_1/BiasAdd | [13, 14] | 110 | 1,000 | 1,040 |
Identity_int8 | [0, 14] | 20 | 0 | 0 |
Identity | [0, 20] | 50 | 0 | 0 |
Total MACs: 29,234,920 Total weight size: 46,512
Energy consumption for 1 run on a NVIDIA V100-PCIE-16Gb for a training phase and an inference phase on the development set. Calculated using code carbon
Training (kWh) | Dev-test (Kwh) | |
---|---|---|
Baseline | 0.210 | 0.068 |
For the task there is (.py file):
task1.py
, DCASE2022 baseline for Task 1, with TFLite model quantizationIn order to account for potential hardware difference, the participants have to report the energy consumption measured while loading the baseline code, loading eval files and get predictions (on their hardware).
Therefore, we provide a .tflite model trained with all the development data model_task1.tflite
.
task1_inference.py
, loads the given model and runs inference for the eval data.
# Creates the carbon instance and start the count
tracker_test_eval = EmissionsTracker("DCASE Task 1 EVAL", output_dir=path_codecarbon)
tracker_test_eval.start()
...
# your code (e.g. load model, run inference)
...
tracker_test_eval.stop() # Stop counter
tracker_test_eval._total_energy.kWh # Get energy value
Evaluation data has to be previously downloaded and the feature extraction step has to be performed.
python create_h5.py --dataset_file='/TAUUrbanAcousticScenes_2022_Mobile_EvaluationSet/meta.csv' --workspace='path' --data_type='eval'
. It takes about 2h and 40 min to go through all the evaluation files.
The code is built on dcase_util toolbox, see manual for tutorials. The machine learning part of the code is built on TensorFlow (v2.1.0).
.
├── task1.py # Baseline system for subtask A
├── task1.yaml # Configuration file for task1a.py
|
├── utils.py # Common functions shared between tasks
├── config.py # Feature parameters and data path
├── TAUUrbanAcousticScenes_2022_Mobile_DevelopmentSet.py # File for the dataset
├── prepare_data.py # File to perfom feature extraction
├── task1_features.yaml # Configureation file for prepare_data.py
├── create_h5.py # File to create .h5 with the extracted features
|
├── task1_inference.py # File to load baseline-trained model
├── model_task1.tflite # baseline-trained model
|
|
├── README.md # This file
└── requirements.txt # External module dependencies
This software is released under the terms of the MIT License.