[2021] BirdCLEF 2021 - Birdcall Identification

Competition link

https://www.kaggle.com/c/birdclef-2021

Evaluation

row-wise micro averaged F1 score.

top10 solutions

1st 2nd 3rd 4th [5th] 6th [7th] [8th] [9th] [10th]

Other (if any)

submission format

row_id,birds
3575_COL_5,wewpew batpig1
3575_COL_10,wewpew batpig1
3575_COL_15,wewpew batpig1
...

2nd place

TLDR

model : ensemble of several CNNs
input :
- Training : mel spectrogram representation of a 30 sec wav-crop
- Inference : predict on 5 sec snippets
- refine the result by a binary bird/nobird classifier and postprocessing to account for metadata.
augmentation
- mixup
- added background noise

Code Pipeline and data setup

To reduce CPU bottleneck, mel spec transformation on GPU using torchaudio.
did mixup augmentation on the GPU
used mixed precision training to further speed up runtime (fp16)

Binary classifier

trained a binary classifier to predict bird / no bird
For this, we used 3 datasets containing binary labels of 10sec recordings (freefield1010, warblrb10k, BirdVox-DCASE-20k) available online.
The model is very similar to SED model used in several past solutions.
Backbones: seresnext26t_32x4d, tf_efficientnet_b0_ns

Bird classifier

trained on 30 sec random crops of the train_short data
To account for the 5sec snippet format of test data, we reshaped the 30sec crops into 6x 5sec parts before feeding through the backbone
After the backbone we reshaped the data again to re-arrange to the 30sec representation by concatenating the respective time segments
then used simple pooling of time and frequency dimension
a simple one layer head which gave us the 398 bird classes
naively used the union of primary and secondary label as target
resnet34, tf_efficientnetv2_s_in21k, tf_efficientnetv2_m_in21k, eca_nfnet_l0
BCE loss
Use the rating for weighting the recordings contribution to the loss
The assumption is that recordings with a lower rating have worse quality with respect to audio and label and should contribute less to model training
weight each sample by rating/max(ratings)
label smoothing to account for noisy annotations and absence of birds in “unlucky” 30sec crops.
For background noise
- a mix of no-call parts of this years validation set and past years data.
- We also not only used mixup between recordings but also within a recording by mixing between the 5 sec parts.
- In mixup we also weight the labels and sample weights accordingly.

Ensembling

a simple mean of the predictions after a step of post-processing
9 models which differ mostly on hyperparameters
backbones and fitted each model with 6 different seeds.

Post processing

The first step for post processing involved choosing an appropriate threshold for making hard predictions for which birds are present in a 5 second segment in soundscapes
different proportions of nocalls and calls which was also apparent from the different sample submission scores (only nocalls).
We also accounted for this imbalance in our validation setup by removing the three nocall songs (see above, CV-3)

A percentile based thresholding approach

threshold = np.percentile(y_preds.flatten(), 0.9987)

Didn't work

TTA with mel spectrograms is not as straightforward as with usual CV data
pseudo tagging in different versions

Codes

https://www.kaggle.com/datasets/christofhenkel/kaggle-birdclef2021-2nd-place-github

6th place

Model

logmelspectrograms
n_fft=1024, sr=32000, hop_length=320, n_mels=64
SED model with DenseNet backbone
Training and inference are performed on 30-second clips
Secondary labels are treated the same as the primary label
BCEloss

Augmentation

waveforms
- Gaussian Noise, Gaussian SNR, Gain, Pink Noise, some environment noise (rain, insect, etc)
- nocall segments from training soundscapes
SpecAugment and mixup.

Other

Restricting the prediction using the geographic location also improved the score a little bit.

3rd

Summary

The final solution was an ensembling of 18 checkpoints trained on different CNNS and fold.
During training each models are trained on a clip of 20 seconds.
A post processing is also applied durgin inference
- based on the prediction at the "clip level" (20 seconds) and the "segment level" (5 seconds)

Explanation

For training, we have weak label
For testing, we are looking to predicting every 5 seconds
Gap, between the label given to the model during training (sample level) and during inference (5-second segment-level)

Training on a clip of 20 seconds and not 5 seconds

training on small clip will introduce noise to our training
we don’t know where and which birds are singing in the 5-second clip
Increasing the length of the clip, allow to reduce this noise
these clips are divided into segments of 5 secondes
for a clip of 20 seconds, 4 spectrograms will be generated
We can use these spectrogram to feed a Deep Learning model.

Model

SED model (seresnet50, EfficientNetB2, EfficientNetB3)
the backbone has two outputs
- the first one is for classifying each segment (one output for every 5 seconds)
- the second one is for classifying each timestep of the segment.
- I use the second output to decide if a bird is present in the 5 seconds or not for inference on test
  - (I took the max probability across the time-step axis for each class)
- I used the first one for feeding the attention block of my "sequential model" used during training.
An attention blocks in order to compute the final output (clip-level prediction) which will be needed for the training.
The Attention block is applied on the first output of the backbone (the dimension should be (BS, 4, num_class) )
During the training
- we are trying to minimise the output of the attention block with a cross entropy loss. We are trying here to predict the birds present in the 20 seconds clip.
- I use label smoothing of 0.05 as the label was still noisy.
- PR AUC for each class then average it in order to have a robust estimation of the model.
During inference
- I am using the outputs of each timesteps of the segments. The dimension should be (BS, 4, timesteps, num_classes)
- I take the max probability across the timesteps dimension and got a vector of (BS, 4, num_classes)
- These vector corresponds to the prediction for each 5-second segment inside the 20 seconds clip
- Then a post-processing is applied, using the final output (prediction at 20-second clip-level).
Different CNNs have been trained such as SeResNet50, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, EfficientNetB7
I did not train them on all folds because of hardware issues(too long to train). After that, I just ensemble these models by CNN type.

Post processing Inference

To do that I create two ensemblings
- one based on the final output (20 sec clip)
- the second one based on the max of the prediction of the timesteps segments.
Then, to decide if a bird appeared, I need to look first if the bird appeared in the ensembling's prediction of the max timesteps segments (based on a threshold t1)
if the bird also appeared in the ensembling's final prediction (based on a threshold t2)

4th

Summary

Ensemble : Max 62model (Best private: 47model)
Inference with global information for SED model
Post-processing

Code

Code: https://github.com/tattaka/birdclef-2021
Inference notebook: https://www.kaggle.com/tattaka/birdclef2021-submissions-pp-ave?scriptVersionId=64016465
Aggregation of the number of birds for post-processing: https://www.kaggle.com/tattaka/make-month-and-site-mask

Preprocess

logmelspec_extractor as below
Augmentation waveform
- gaussian, uniform, pink noise
Augmentation logmel
- Mixup, 0.2~0.5 and alpha=0.8

logmelspec_extractor = nn.Sequential(
            MelSpectrogram(
                32000,
                n_mels=128,
                f_min=20,
                n_fft=2048,
                hop_length=512,
                normalized=True,
            ),
            AmplitudeToDB(top_db=80.0),
            NormalizeMelSpec(),
        )

Modeling

Using various backbones
- seconds(max 30s, min 10s
- bs 36

Training

clipwise_pred is optimized directly
- By doing this, I was able to suppress the generation of NaN on backward
use a secondary label
- soft labels such as 0.5 did not work
trained 40~50 epochs

Pseudo labeling

Simple per-audio file relabeling
- Use clipwise_pred and threshold = 0.2
I saved framewise_pred and time_att with the whole audio file as input, and computed clipwise_pred by cropping the period used for training
Create a pseudo label using the second method only for audio files without a secondary_label
If the pseudo label was below a certain value (0.05), the probability of 0.1 was used to set the primary and secondary labels to 0.

30s finetuning

Larger models
- trained with shorter segments, such as 10s
- become noisy
- only the Attention module was trained with 30s segments on 10 additional epochs
Checkpoint selection
Inference with global information
- SEDモデルが「サウンドイベント検出」を正確に実行できれば、より長い時間セグメントから予測の必要な部分を切り抜くことで、より短い時間セグメントの予測を改善できる。
- 実際には、より長い期間のセグメントとして30を使用し、予測したい5秒の間隔が中央になるように実装。
- a high threshold (0.05) in clipwise_pred_30s to generate a list of possible birds
- a low threshold (0.025) in clipwise_pred_5s, to perform AND operations

Ensemble

simple average
hard voting but the score did not change much

Post-processing with location and date

A list of birds observed within 450 meters around each site and a list of birds appearing in each month was made, and those not appearing were removed from the submission.
Here, because the two sites "COR" and "COL" are relatively close, the birds have only been removed if both are zero.
I have also quadrupled the thresholds for some rare classes of birds observed in the vicinity of the sites.
This post-processing resulted in a consistent improvement of less than 0.01 for both train_soundscape and public LB.

osuossu8 / kaggle-solution

[2021] BirdCLEF 2021 - Birdcall Identification #3