AudioCLIP eval mode gives different output

winstonww commented 2 years ago

I was looking at the AudioCLIPEncoder and tried running the AudioCLIP demo here:

https://github.com/AndreyGuzhov/AudioCLIP/blob/master/demo/AudioCLIP.ipynb

I found that if we replace this line:

aclp = AudioCLIP(pretrained=f'../assets/{MODEL_FILENAME}')

with

aclp = AudioCLIP(pretrained=f'../assets/{MODEL_FILENAME}').eval()

the results returned are different and accuracy is lower in eval mode.

Unfortunately, since the original mode does not support input of batch size 1, we are currently using eval mode in our AudioCLIPEncoder .

I will file an issue in the AudioCLIP repo, but also filing an issue here to keep track of things.

winstonww commented 2 years ago

Oops, seems like AudioCLIP repo does not support filing Issues, we might have to find a workaround for this.

winstonww commented 2 years ago

Looking into this further, it seems like eval mode is the expected mode. The model prediction of an input varies with the batch it is in if model is in training mode. For instance, when model is in training mode:

                Filename, Audio                 Textual Label (Confidence)

           cat_3-95694-A-5.wav ->                           cat (99.95%),        car horn (00.04%),          speech (00.01%),    thunderstorm (00.00%),     alarm clock (00.00%),        coughing (00.00%)
     coughing_1-58792-A-24.wav ->                      coughing (99.35%),        car horn (00.55%),             cat (00.03%),     alarm clock (00.03%),    thunderstorm (00.02%),          speech (00.02%)
 alarm_clock_3-120526-B-37.wav ->                   alarm clock (99.87%),        car horn (00.09%),    thunderstorm (00.02%),             cat (00.01%),        coughing (00.00%),          speech (00.00%)
     thunder_3-144891-B-19.wav ->                  thunderstorm (99.35%),        car horn (00.38%),             cat (00.17%),     alarm clock (00.06%),        coughing (00.03%),          speech (00.01%)
     car_horn_1-24074-A-43.wav ->                      car horn (96.10%),    thunderstorm (02.17%),        coughing (01.13%),             cat (00.52%),     alarm clock (00.05%),          speech (00.02%)

And if we place thunder_3-144891-B-19.wav in a different query batch:

              Filename, Audio                 Textual Label (Confidence)

     thunder_6-144891-B-19.wav ->                           cat (48.35%),    thunderstorm (23.08%),        car horn (21.84%),        coughing (03.48%),     alarm clock (02.79%),          speech (00.46%)
     thunder_5-144891-B-19.wav ->                           cat (48.35%),    thunderstorm (23.08%),        car horn (21.84%),        coughing (03.48%),     alarm clock (02.79%),          speech (00.46%)
     thunder_4-144891-B-19.wav ->                           cat (48.35%),    thunderstorm (23.08%),        car horn (21.84%),        coughing (03.48%),     alarm clock (02.79%),          speech (00.46%)
     thunder_3-144891-B-19.wav ->                           cat (48.35%),    thunderstorm (23.08%),        car horn (21.84%),        coughing (03.48%),     alarm clock (02.79%),          speech (00.46%)
     car_horn_1-24074-A-43.wav ->                      car horn (87.79%),             cat (08.73%),        coughing (01.58%),     alarm clock (01.47%),    thunderstorm (00.33%),          speech (00.09%)

thunder_3-144891-B-19.wav yields completely different output labels if it is placed in different query batches.

If model in eval mode, output labels of a sample is invariant to query batch the sample is in, and this should be the expected behavior.

tadejsv commented 2 years ago

@winstonww This is the expected behavior. The model has several layers (for example Dropout and BatchNorm) which alter the output in training even without backpropagation changing their weights - they are regularization layers, this is their job.

When the model is put into evaluation mode, these layers stop doing that (BatchNorm becomes fixed, and Dropout is turned off completely).

I would also not expect this to lead to lower accuracy, as there is no real change in model weights (and the model was trained to be as invariant to random changes in the regularization layers as possible). What method are you using for evaluation?

As for supporting batch size of one - what did you mean by this? The input shape that model supports stays the same in train and eval mode

winstonww commented 2 years ago

As for supporting batch size of one - what did you mean by this?

@tadejsv If you actually run the code in the notebook mentioned in the description (https://github.com/AndreyGuzhov/AudioCLIP/blob/master/demo/AudioCLIP.ipynb), you will find that since BatchNorm does not support input of single sample, it will raise an error if you limit the number of inputs (audio tracks) to one.

I would also not expect this to lead to lower accuracy, as there is no real change in model weights (and the model was trained to be as invariant to random changes in the regularization layers as possible).

Well, results (encodings) are observed to be different in training vs eval mode with the given demo code. As already described above, if you run the code with eval, you will see lower accuracy with the given example. For your reference, I have extracted the demo code below (notice the line

# TRY TO REMOVE .eval() BELOW YOU WILL SEE DIFFERENT RESULTS
aclp = AudioCLIP(pretrained=f'assets/{MODEL_FILENAME}').eval()

), download the AuidoCLIP model into assets directory, copy the following into the AudioCLIP repo and run:

import os
import sys
import glob

import librosa

import numpy as np

import torch

import matplotlib.pyplot as plt

sys.path.append(os.path.abspath(f'{os.getcwd()}'))

from model import AudioCLIP
from utils.transforms import ToTensor1D

torch.set_grad_enabled(False)

MODEL_FILENAME = 'AudioCLIP-Full-Training.pt'
SAMPLE_RATE = 44100
LABELS = ['cat', 'thunderstorm', 'coughing', 'alarm clock', 'car horn']

# TRY TO REMOVE .eval() BELOW YOU WILL SEE DIFFERENT RESULTS
aclp = AudioCLIP(pretrained=f'assets/{MODEL_FILENAME}').eval()

audio_transforms = ToTensor1D()
paths_to_audio = glob.glob('audio/*.wav')
text = [[label] for label in LABELS]

audio = []
for i, path_to_audio in enumerate(paths_to_audio):
    track, _ = librosa.load(path_to_audio, sr=SAMPLE_RATE, dtype=np.float32)
    audio.append(track)
audio = torch.stack([audio_transforms(track.reshape(1, -1)) for track in audio])
((audio_features, _, _), _), _ = aclp(audio=audio)

((_, _, text_features), _), _ = aclp(text=text)

audio_features = audio_features / torch.linalg.norm(audio_features, dim=-1, keepdim=True)
text_features = text_features / torch.linalg.norm(text_features, dim=-1, keepdim=True)
print(audio_features)
print(text_features)

scale_audio_text = torch.clamp(aclp.logit_scale_at.exp(), min=1.0, max=100.0)
logits_audio_text = scale_audio_text * audio_features @ text_features.T

print('\t\tFilename, Audio\t\t\tTextual Label (Confidence)', end='\n\n')

# calculate model confidence
confidence = logits_audio_text.softmax(dim=1)
for audio_idx in range(logits_audio_text.shape[0]):
    # acquire Top-3 most similar results
    conf_values, ids = confidence[audio_idx].topk(len(LABELS))

    # format output strings
    query = f'{os.path.basename(paths_to_audio[audio_idx]):>30s} ->\t\t'
    results = ', '.join([f'{LABELS[i]:>15s} ({v:06.2%})' for v, i in zip(conf_values, ids)])

    print(query + results)

The result with and without eval is as follows.

With eval:

           cat_3-95694-A-5.wav ->                           cat (100.00%),        car horn (00.00%),    thunderstorm (00.00%),     alarm clock (00.00%),        coughing (00.00%)
     coughing_1-58792-A-24.wav ->                           cat (51.26%),        car horn (23.23%),        coughing (14.29%),     alarm clock (05.65%),    thunderstorm (05.58%)
 alarm_clock_3-120526-B-37.wav ->                   alarm clock (46.50%),             cat (27.93%),        car horn (18.48%),    thunderstorm (04.46%),        coughing (02.63%)
     thunder_3-144891-B-19.wav ->                      car horn (40.28%),             cat (38.32%),    thunderstorm (13.92%),        coughing (04.50%),     alarm clock (02.97%)
     car_horn_1-24074-A-43.wav ->                      car horn (51.59%),             cat (33.20%),    thunderstorm (07.57%),        coughing (04.24%),     alarm clock (03.40%)

Without eval:

           cat_3-95694-A-5.wav ->                           cat (99.96%),        car horn (00.04%),    thunderstorm (00.00%),     alarm clock (00.00%),        coughing (00.00%)
     coughing_1-58792-A-24.wav ->                      coughing (99.36%),        car horn (00.55%),             cat (00.03%),     alarm clock (00.03%),    thunderstorm (00.02%)
 alarm_clock_3-120526-B-37.wav ->                   alarm clock (99.87%),        car horn (00.09%),    thunderstorm (00.02%),             cat (00.01%),        coughing (00.00%)
     thunder_3-144891-B-19.wav ->                  thunderstorm (99.36%),        car horn (00.38%),             cat (00.17%),     alarm clock (00.06%),        coughing (00.03%)
     car_horn_1-24074-A-43.wav ->                      car horn (96.12%),    thunderstorm (02.17%),        coughing (01.13%),             cat (00.52%),     alarm clock (00.05%)

tadejsv commented 2 years ago

I see, thanks for the explanation. I will also look into this a bit later, it's a very surprising behavior

jina-ai / executors

AudioCLIP eval mode gives different output #263