Open osuossu8 opened 2 years ago
TLDR
model : ensemble of several CNNs
input :
augmentation
Code Pipeline and data setup
Binary classifier
Bird classifier
trained on 30 sec random crops of the train_short data
To account for the 5sec snippet format of test data, we reshaped the 30sec crops into 6x 5sec parts before feeding through the backbone
After the backbone we reshaped the data again to re-arrange to the 30sec representation by concatenating the respective time segments
then used simple pooling of time and frequency dimension
a simple one layer head which gave us the 398 bird classes
naively used the union of primary and secondary label as target
resnet34, tf_efficientnetv2_s_in21k, tf_efficientnetv2_m_in21k, eca_nfnet_l0
BCE loss
Use the rating for weighting the recordings contribution to the loss
The assumption is that recordings with a lower rating have worse quality with respect to audio and label and should contribute less to model training
weight each sample by rating/max(ratings)
label smoothing to account for noisy annotations and absence of birds in “unlucky” 30sec crops.
For background noise
Ensembling
Post processing
A percentile based thresholding approach
Didn't work
Codes
https://www.kaggle.com/datasets/christofhenkel/kaggle-birdclef2021-2nd-place-github
Model
Augmentation
Other
Summary
Explanation
Training on a clip of 20 seconds and not 5 seconds
Model
SED model (seresnet50, EfficientNetB2, EfficientNetB3)
the backbone has two outputs
An attention blocks in order to compute the final output (clip-level prediction) which will be needed for the training.
The Attention block is applied on the first output of the backbone (the dimension should be (BS, 4, num_class) )
During the training
During inference
Different CNNs have been trained such as SeResNet50, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, EfficientNetB7
I did not train them on all folds because of hardware issues(too long to train). After that, I just ensemble these models by CNN type.
Post processing Inference
Summary
Code
Preprocess
logmelspec_extractor = nn.Sequential(
MelSpectrogram(
32000,
n_mels=128,
f_min=20,
n_fft=2048,
hop_length=512,
normalized=True,
),
AmplitudeToDB(top_db=80.0),
NormalizeMelSpec(),
)
Modeling
Training
clipwise_pred is optimized directly
use a secondary label
trained 40~50 epochs
Pseudo labeling
30s finetuning
Larger models
Checkpoint selection
Inference with global information
Ensemble
Post-processing with location and date
A list of birds observed within 450 meters around each site and a list of birds appearing in each month was made, and those not appearing were removed from the submission.
Here, because the two sites "COR" and "COL" are relatively close, the birds have only been removed if both are zero.
I have also quadrupled the thresholds for some rare classes of birds observed in the vicinity of the sites.
This post-processing resulted in a consistent improvement of less than 0.01 for both train_soundscape and public LB.
Competition link
https://www.kaggle.com/c/birdclef-2021
Evaluation
row-wise micro averaged F1 score.
top10 solutions
1st 2nd 3rd 4th [5th] 6th [7th] [8th] [9th] [10th]
Other (if any)
submission format