Open winstonww opened 2 years ago
Oops, seems like AudioCLIP
repo does not support filing Issues, we might have to find a workaround for this.
Looking into this further, it seems like eval
mode is the expected mode. The model prediction of an input varies with the batch it is in if model is in training mode. For instance, when model is in training mode:
Filename, Audio Textual Label (Confidence)
cat_3-95694-A-5.wav -> cat (99.95%), car horn (00.04%), speech (00.01%), thunderstorm (00.00%), alarm clock (00.00%), coughing (00.00%)
coughing_1-58792-A-24.wav -> coughing (99.35%), car horn (00.55%), cat (00.03%), alarm clock (00.03%), thunderstorm (00.02%), speech (00.02%)
alarm_clock_3-120526-B-37.wav -> alarm clock (99.87%), car horn (00.09%), thunderstorm (00.02%), cat (00.01%), coughing (00.00%), speech (00.00%)
thunder_3-144891-B-19.wav -> thunderstorm (99.35%), car horn (00.38%), cat (00.17%), alarm clock (00.06%), coughing (00.03%), speech (00.01%)
car_horn_1-24074-A-43.wav -> car horn (96.10%), thunderstorm (02.17%), coughing (01.13%), cat (00.52%), alarm clock (00.05%), speech (00.02%)
And if we place thunder_3-144891-B-19.wav
in a different query batch:
Filename, Audio Textual Label (Confidence)
thunder_6-144891-B-19.wav -> cat (48.35%), thunderstorm (23.08%), car horn (21.84%), coughing (03.48%), alarm clock (02.79%), speech (00.46%)
thunder_5-144891-B-19.wav -> cat (48.35%), thunderstorm (23.08%), car horn (21.84%), coughing (03.48%), alarm clock (02.79%), speech (00.46%)
thunder_4-144891-B-19.wav -> cat (48.35%), thunderstorm (23.08%), car horn (21.84%), coughing (03.48%), alarm clock (02.79%), speech (00.46%)
thunder_3-144891-B-19.wav -> cat (48.35%), thunderstorm (23.08%), car horn (21.84%), coughing (03.48%), alarm clock (02.79%), speech (00.46%)
car_horn_1-24074-A-43.wav -> car horn (87.79%), cat (08.73%), coughing (01.58%), alarm clock (01.47%), thunderstorm (00.33%), speech (00.09%)
thunder_3-144891-B-19.wav
yields completely different output labels if it is placed in different query batches.
If model in eval mode, output labels of a sample is invariant to query batch the sample is in, and this should be the expected behavior.
@winstonww This is the expected behavior. The model has several layers (for example Dropout
and BatchNorm
) which alter the output in training even without backpropagation changing their weights - they are regularization layers, this is their job.
When the model is put into evaluation mode, these layers stop doing that (BatchNorm
becomes fixed, and Dropout
is turned off completely).
I would also not expect this to lead to lower accuracy, as there is no real change in model weights (and the model was trained to be as invariant to random changes in the regularization layers as possible). What method are you using for evaluation?
As for supporting batch size of one - what did you mean by this? The input shape that model supports stays the same in train and eval mode
As for supporting batch size of one - what did you mean by this?
@tadejsv If you actually run the code in the notebook mentioned in the description (https://github.com/AndreyGuzhov/AudioCLIP/blob/master/demo/AudioCLIP.ipynb), you will find that since BatchNorm
does not support input of single sample, it will raise an error if you limit the number of inputs (audio tracks) to one.
I would also not expect this to lead to lower accuracy, as there is no real change in model weights (and the model was trained to be as invariant to random changes in the regularization layers as possible).
Well, results (encodings) are observed to be different in training vs eval mode with the given demo code. As already described above, if you run the code with eval
, you will see lower accuracy with the given example. For your reference, I have extracted the demo code below (notice the line
# TRY TO REMOVE .eval() BELOW YOU WILL SEE DIFFERENT RESULTS
aclp = AudioCLIP(pretrained=f'assets/{MODEL_FILENAME}').eval()
), download the AuidoCLIP model into assets
directory, copy the following into the AudioCLIP
repo and run:
import os
import sys
import glob
import librosa
import numpy as np
import torch
import matplotlib.pyplot as plt
sys.path.append(os.path.abspath(f'{os.getcwd()}'))
from model import AudioCLIP
from utils.transforms import ToTensor1D
torch.set_grad_enabled(False)
MODEL_FILENAME = 'AudioCLIP-Full-Training.pt'
SAMPLE_RATE = 44100
LABELS = ['cat', 'thunderstorm', 'coughing', 'alarm clock', 'car horn']
# TRY TO REMOVE .eval() BELOW YOU WILL SEE DIFFERENT RESULTS
aclp = AudioCLIP(pretrained=f'assets/{MODEL_FILENAME}').eval()
audio_transforms = ToTensor1D()
paths_to_audio = glob.glob('audio/*.wav')
text = [[label] for label in LABELS]
audio = []
for i, path_to_audio in enumerate(paths_to_audio):
track, _ = librosa.load(path_to_audio, sr=SAMPLE_RATE, dtype=np.float32)
audio.append(track)
audio = torch.stack([audio_transforms(track.reshape(1, -1)) for track in audio])
((audio_features, _, _), _), _ = aclp(audio=audio)
((_, _, text_features), _), _ = aclp(text=text)
audio_features = audio_features / torch.linalg.norm(audio_features, dim=-1, keepdim=True)
text_features = text_features / torch.linalg.norm(text_features, dim=-1, keepdim=True)
print(audio_features)
print(text_features)
scale_audio_text = torch.clamp(aclp.logit_scale_at.exp(), min=1.0, max=100.0)
logits_audio_text = scale_audio_text * audio_features @ text_features.T
print('\t\tFilename, Audio\t\t\tTextual Label (Confidence)', end='\n\n')
# calculate model confidence
confidence = logits_audio_text.softmax(dim=1)
for audio_idx in range(logits_audio_text.shape[0]):
# acquire Top-3 most similar results
conf_values, ids = confidence[audio_idx].topk(len(LABELS))
# format output strings
query = f'{os.path.basename(paths_to_audio[audio_idx]):>30s} ->\t\t'
results = ', '.join([f'{LABELS[i]:>15s} ({v:06.2%})' for v, i in zip(conf_values, ids)])
print(query + results)
The result with and without eval
is as follows.
With eval
:
cat_3-95694-A-5.wav -> cat (100.00%), car horn (00.00%), thunderstorm (00.00%), alarm clock (00.00%), coughing (00.00%)
coughing_1-58792-A-24.wav -> cat (51.26%), car horn (23.23%), coughing (14.29%), alarm clock (05.65%), thunderstorm (05.58%)
alarm_clock_3-120526-B-37.wav -> alarm clock (46.50%), cat (27.93%), car horn (18.48%), thunderstorm (04.46%), coughing (02.63%)
thunder_3-144891-B-19.wav -> car horn (40.28%), cat (38.32%), thunderstorm (13.92%), coughing (04.50%), alarm clock (02.97%)
car_horn_1-24074-A-43.wav -> car horn (51.59%), cat (33.20%), thunderstorm (07.57%), coughing (04.24%), alarm clock (03.40%)
Without eval
:
cat_3-95694-A-5.wav -> cat (99.96%), car horn (00.04%), thunderstorm (00.00%), alarm clock (00.00%), coughing (00.00%)
coughing_1-58792-A-24.wav -> coughing (99.36%), car horn (00.55%), cat (00.03%), alarm clock (00.03%), thunderstorm (00.02%)
alarm_clock_3-120526-B-37.wav -> alarm clock (99.87%), car horn (00.09%), thunderstorm (00.02%), cat (00.01%), coughing (00.00%)
thunder_3-144891-B-19.wav -> thunderstorm (99.36%), car horn (00.38%), cat (00.17%), alarm clock (00.06%), coughing (00.03%)
car_horn_1-24074-A-43.wav -> car horn (96.12%), thunderstorm (02.17%), coughing (01.13%), cat (00.52%), alarm clock (00.05%)
I see, thanks for the explanation. I will also look into this a bit later, it's a very surprising behavior
I was looking at the
AudioCLIPEncoder
and tried running the AudioCLIP demo here:https://github.com/AndreyGuzhov/AudioCLIP/blob/master/demo/AudioCLIP.ipynb
I found that if we replace this line:
aclp = AudioCLIP(pretrained=f'../assets/{MODEL_FILENAME}')
with
aclp = AudioCLIP(pretrained=f'../assets/{MODEL_FILENAME}').eval()
the results returned are different and accuracy is lower in eval mode.
Unfortunately, since the original mode does not support input of batch size 1, we are currently using
eval
mode in ourAudioCLIPEncoder
.I will file an issue in the
AudioCLIP
repo, but also filing an issue here to keep track of things.