Important update
8th September 2020: Code associated to papers ss_sed_paper and sed_paper
available in the branch papers_code
scripts
to get the recorded data in the download.DESED_synth_dcase20_train_jams.tar
on DESED_synthetic
and comment reverb since we do not use it for the baseline.Python >= 3.6, pytorch >= 1.0, cudatoolkit>=9.0, pandas >= 0.24.1, scipy >= 1.2.1, pysoundfile >= 0.10.2, scaper >= 1.3.5, librosa >= 0.6.3, youtube-dl >= 2019.4.30, tqdm >= 4.31.1, ffmpeg >= 4.1, dcase_util >= 0.2.5, sed-eval >= 0.2.1, psds-eval >= 0.1.0, desed >= 1.4.0
A simplified installation procedure example is provided below for python 3.6 based Anconda distribution for Linux based system:
conda_create_environment.sh
(recommended line by line)If you are on this repo and searching for the code assoctiated to:
Please go to the branch papers_code
.
Please check the submission_page.
The evaluation data is on this eval_zenodo_repo.
Before doing your submission, please check your submission folder for task 4 with the dedicated scripts:
python validate_submissions.py -i <path to task 4 submission folder>
This year, a sound separation model is used: see sound-separation folder which is the fuss_repo integrated as a git subtree.
More info in Original FUSS model repo.
More info in the baseline folder.
The baseline to combine SS and SED is a late integration.
The sound separation baseline has been trained using 3 sources, so it returns:
In our case, we use only the ouput of the second source.
To get the predictions of the combination of SED and SS we do as follow:
System performance are reported in term of event-based F-scores [1] with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets.
Additionally, the PSDS [2] performance are reported.
F-scores are computed using a single operating point (threshold=0.5) while other PSDS values are computed using 50 operating points (linear from 0.01 to 0.99).
Macro F-score Event-based | PSDS macro F-score | PSDS | PSDS cross-trigger | PSDS macro | |
---|---|---|---|---|---|
Validation | 34.8 % | 60.0% | 0.610 | 0.524 | 0.433 |
Macro F-score Event-based | PSDS macro F-score | PSDS | PSDS cross-trigger | PSDS macro | |
---|---|---|---|---|---|
Validation | 35.6 % | 60.5% | 0.626 | 0.546 | 0.449 |
Validation roc curves
SED baseline | SED + SS baseline | |
---|---|---|
psds roc curve | ||
psds cross-trigger curve | ||
psds macro curve |
Please refer to the PSDS paper [2] for more information about it. The parameters used for psds performances are:
The difference between the 3 performances reported:
alpha_ct | alpha_st | |
---|---|---|
PSDS | 0 | 0 |
PSDS cross-trigger | 1 | 0 |
PSDS macro | 0 | 1 |
alpha_ct is the cost of cross-trigger, alpha_st is the cost of instability across classes.
See baseline folder.
All the scripts to get the data (soundbank, generated, separated) are in the scripts
folder
and they use python files from data_generation
folder.
In the scripts/
folder, you can find the different steps to:
It is likely that you'll have download issues with the real recordings.
At the end of the download, please send a mail with the TSV files
created in the missing_files
directory. (to Nicolas Turpault and Romain Serizel).
However, if none of the audio files have been downloaded, it is probably due to an internet, proxy problem. See Desed repo or Desed_website for more info.
sound-separation/
here (using subtree))
sound-separation/models/dcase2020_fuss_baseline/inference.py
The dataset for sound event detection of DCASE2020 task 4 is composed of:
Note: the reverberated data (see scripts) are not computed for the baseline
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file (.tsv) under the following format:
[filename (string)][tab][event_labels (strings)]
For example:
Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog
Synthetic subset and validation set have strong annotations.
The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file (.tsv) under the following format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event_label (strings)]
For example:
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
The free universal sound separation (FUSS) dataset 3 contains mixtures of arbitrary sources of different types for use in training sound separation models. Each 10 second mixture contains between 1 and 4 sounds.
The source clips for the mixtures are from a prerelease of FSD50k 4, 5, which is composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50k labels, the sound source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these sound source files, and are not considered part of the challenge, although they will become available when FSD50k is released.
Train:
Validation:
Author | Affiliation |
---|---|
Nicolas Turpault | INRIA |
Romain Serizel | University of Lorraine |
Scott Wisdom | Google Research |
John R. Hershey | Google Research |
Hakan Erdogan | Google Research |
Justin Salamon | Adobe Research |
Dan Ellis | Google Research |
Prem Seetharaman | Northwestern University |
If you have any problem feel free to contact Nicolas (and Romain )
In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.