[2020] Cornell Birdcall Identification

osuossu8 commented 3 years ago

Competition link

https://www.kaggle.com/c/birdsong-recognition

Evaluation

row-wise micro averaged F1 score

top10 solutions

1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th

Other (if any)

submission format

row_id,birds
site_1_0a997dff022e3ad9744d4e7bbf923288_5,amecro
site_1_0a997dff022e3ad9744d4e7bbf923288_10,amecro amerob
site_1_0a997dff022e3ad9744d4e7bbf923288_15,nocall

osuossu8 commented 3 years ago

1st place

Data Augmentation

No external data.

Pink noise
Gaussian noise
Gaussian SNR
Gain (Volume Adjustment)

Models

I noticed that the default SED model had over 80 million parameters so I switched all my models to use a pretrained densenet121 model as the cnn feature extractor and reduced the attention block size to 1024.
- I also replaced the clamp on the attention with tanh as mentioned in the comments on the SED notebook

4 fold models without mixup 4 fold models with mixup 5 fold models without mixup

Training

Cosine Annealing Scheduler with warmup
Mixup (on 4 of the final models)
50 epochs for non-mixup models and 100 epochs for mixup models
SpecAugmentation enabled
AdamW with weight_decay 0.01
30 second audio clips during training and evaluating on 2 30 second clips per audio.

class SedScaledPosNegFocalLoss(nn.Module):
    def __init__(self, gamma=0.0, alpha_1=1.0, alpha_0=1.0, secondary_factor=1.0):
        super().__init__()

        self.loss_fn = nn.BCELoss(reduction='none')
        self.secondary_factor = secondary_factor
        self.gamma = gamma
        self.alpha_1 = alpha_1
        self.alpha_0 = alpha_0
        self.loss_keys = ["bce_loss", "F_loss", "FScaled_loss", "F_loss_0", "F_loss_1"]

    def forward(self, y_pred, y_target):
        y_true = y_target["all_labels"]
        y_sec_true = y_target["secondary_labels"]
        bs, s, o = y_true.shape

        # Sigmoid has already been applied in the model
        y_pred = torch.clamp(y_pred, min=EPSILON_FP16, max=1.0-EPSILON_FP16)
        y_pred = y_pred.reshape(bs*s,o)
        y_true = y_true.reshape(bs*s,o)
        y_sec_true = y_sec_true.reshape(bs*s,o)

        with torch.no_grad():
            y_all_ones_mask = torch.ones_like(y_true, requires_grad=False)
            y_all_zeros_mask = torch.zeros_like(y_true, requires_grad=False)
            y_all_mask = torch.where(y_true > 0.0, y_all_ones_mask, y_all_zeros_mask)
            y_ones_mask = torch.ones_like(y_sec_true, requires_grad=False)
            y_zeros_mask = torch.ones_like(y_sec_true, requires_grad=False) *self.secondary_factor
            y_secondary_mask = torch.where(y_sec_true > 0.0, y_zeros_mask, y_ones_mask)
        bce_loss = self.loss_fn(y_pred, y_true)
        pt = torch.exp(-bce_loss)
        F_loss_0 = (self.alpha_0*(1-y_all_mask)) * (1-pt)**self.gamma * bce_loss
        F_loss_1 = (self.alpha_1*y_all_mask) * (1-pt)**self.gamma * bce_loss

        F_loss = F_loss_0 + F_loss_1

        FScaled_loss = y_secondary_mask*F_loss
        FScaled_loss = FScaled_loss.mean()

        return FScaled_loss, {"bce_loss": bce_loss.mean(), "F_loss_1": F_loss_1.mean(), "F_loss_0": F_loss_0.mean(), "F_loss": F_loss.mean(), "FScaled_loss": FScaled_loss }

Thresholds

I used a threshold of 0.3
During inference I also applied 10 TTA by just adding the same audio sample 10 times in the batch and enabling Spec Augmentation.

CV vs LB

My CV didn't match the public LB at all, so I mainly relied on the LB for feedback.

Ensemble

I used voting to ensemble the models.
My voting selection was based on LB score so in total I had 13 models with 4 votes to consider if the bird existed or not.

osuossu8 commented 3 years ago

2nd place

nb https://www.kaggle.com/vlomme/surfin-bird-2nd-place

git https://github.com/vlomme/Birdcall-Identification-competition/blob/master/train.py

Due to a weak PC and to speed up training, I saved the Mel spectrograms and later worked with them
I raised the image to a power of 0.5 to 3. at 0.5, the background noise is closer to the birds, and at 3, on the contrary, the quiet sounds become even quieter.
recording だんだん加速、ゆっくりダウン
鳥以外の noise (rain, noise, conversations) を加える
Added white, pink, and band noise. Increasing the noise level increases recall, but reduces precision.
Used BCEWithLogitsLoss. For the main birds, the label was 1. For birds in the background 0.3.
but only on validation files similar to the test sample (see dataset)
Added 265 class nocall, but it didn't help much
The final solution consisted of an ensemble of 6 models
- one of which trained on 2.5-second recordings
- one of which only trained on 150 classes.
- But this model did not work much better than an ensemble of 3 models, where everyone studied in 5 seconds and 265 classes.
Model predictions were squared, averaged, and the root was extracted. The rating slightly increased, compared to simple averaging.
All models gave similar quality, but the best was efficientnet-b0, resnet50, densenet121.
Pre-trained models work better
Spectrogram worked slightly worse than melspectrograms
Large networks worked slightly worse than small ones
n_fft = 892, sr = 21952, hop_length=245, n_mels = 224, len_chack 448(or 224), image_size = 224*448
IMPORTANT! If there was a bird in the segment, I increased the probability of finding it in the entire file.
I tried pseudo-labels, predict classes on training files, and train using new labels, but the quality decreased slightly
A small learning rate reduced the rating

osuossu8 commented 3 years ago

3rd place

https://www.kaggle.com/c/birdsong-recognition/discussion/183199

nb : https://www.kaggle.com/theoviel/training-a-winning-model?scriptVersionId=42814701

git : https://github.com/TheoViel/kaggle_birdcall_identification

Data augmentation is the key to reduce the discrepancy between train and test. We start by randomly cropping 5 seconds of the audio and then add aggressive noise augmentations :

Gaussian noise
With a soud to noise ratio up to 0.5

Background noise
We randomly chose 5 seconds of a sample in the background dataset available here. This dataset contains samples without bircall from the example test audios from the competition data, and some samples from the freesound bird detection challenge that were manually selected.

Modified Mixup
Mixup creates a combination of a batch x1 and its shuffled version x2 : x = a * x1 + (1 - a) * x2 where a is samples with a beta distribution.
Then, instead of using the classical objective for mixup, we define the target associated to x as the union of the original targets.
This forces the model to correctly predict both labels.
Mixup is applied with probability 0.5 and I used 5 as parameter for the beta disctribution, which forces a to be close to 0.5.

Improved cropping
Instead of randomly selecting the crops, selecting them based on out-of-fold confidence was also used. The confidence at time t is the probability of the ground truth class predicted on the 5 second crop starting from t.

osuossu8 / kaggle-solution

[2020] Cornell Birdcall Identification #2

Competition link

Evaluation

top10 solutions

Other (if any)

1st place

Data Augmentation

Models

Training

Thresholds

CV vs LB

Ensemble

2nd place

3rd place