fix(audio-clip-encoder): remove .eval which mess with the model outputs

samsja commented 2 years ago

Related to this issue in Jina core : https://github.com/jina-ai/jina/issues/4163

We ( I and @JohannesMessner ) remove the .eval() method call as we saw that it breaks model performance and we had to small change in the encoding to make it work.

The author does not call this method in his demo : https://github.com/AndreyGuzhov/AudioCLIP/blob/master/demo/AudioCLIP.ipynb

cristianmtr commented 2 years ago

Block this PR. Because this does not make much sense.

Prediction should not be affected by batch normalization

We had the same problem in the fashion search workstream in the beginning. See here. Is this related? When in evaluation (prediction without learning) mode you don't want BatchNorm, right?

JoanFM commented 2 years ago

Block this PR. Because this does not make much sense. Prediction should not be affected by batch normalization

We had the same problem in the fashion search workstream in the beginning. See here. Is this related? When in evaluation (prediction without learning) mode you don't want BatchNorm, right?

Yes, the point here is we are suggesting that using eval is the source of the problem.

If that is true that the paper author does not use eval in their notebook, we should query on their github repo asking why.

After all at inference time inference in batch should give same vectors as doing it one by one

JohannesMessner commented 2 years ago

Block this PR. Because this does not make much sense. Prediction should not be affected by batch normalization

We had the same problem in the fashion search workstream in the beginning. See here. Is this related? When in evaluation (prediction without learning) mode you don't want BatchNorm, right?

Yes, the point here is we are suggesting that using eval is the source of the problem.

If that is true that the paper author does not use eval in their notebook, we should query on their github repo asking why.

After all at inference time inference in batch should give same vectors as doing it one by one

We don't really understand what is going on in .eval(), but we don't think that it is just batch norm; it seems that calling it completely changes the behaviour of forward(). The author doesn't seem to implement eval() himself, so it probably has something to do with one of the models that are inside his model. We can ask him. Btw, the implementation in general seems a bit dodgy, so we don't have a great feeling about it even if we fix this particular problem.

JoanFM commented 2 years ago

Block this PR. Because this does not make much sense. Prediction should not be affected by batch normalization

We had the same problem in the fashion search workstream in the beginning. See here. Is this related? When in evaluation (prediction without learning) mode you don't want BatchNorm, right?

Yes, the point here is we are suggesting that using eval is the source of the problem. If that is true that the paper author does not use eval in their notebook, we should query on their github repo asking why. After all at inference time inference in batch should give same vectors as doing it one by one

We don't really understand what is going on in .eval(), but we don't think that it is just batch norm; it seems that calling it completely changes the behaviour of forward(). The author doesn't seem to implement eval() himself, so it probably has something to do with one of the models that are inside his model. We can ask him. Btw, the implementation in general seems a bit dodgy, so we don't have a great feeling about it even if we fix this particular problem.

Have we tried to run his notebook where both the audio and the text embedding are done without batching, and with/without eval?

nan-wang commented 2 years ago

I would like to reject this PR because removing .eval() will leads to inconsistent embeddings. This is an upstream issue. Let's raise it up to the upstream author. Here is the code snippets from @winstonww to show the bug.

import torch
from model import AudioCLIP
aclp = AudioCLIP(pretrained=f'assets/AudioCLIP-Full-Training.pt')
a1 = torch.zeros((1000))
a2 = torch.rand((1000))
a3 = torch.rand((1000))

batch1 = torch.stack([a1, a2])
batch2 = torch.stack([a1, a3])

((batch1_output, _, _), _), _ = aclp(audio=batch1)
((batch2_output, _, _), _), _ = aclp(audio=batch2)
# a1's embedding should be the same regard less of batching
assert batch1_output[0].equal(batch2_output[0])

reference: https://jinaai.slack.com/archives/C0183GCN8AK/p1643031021046800?thread_ts=1643019655.039700&cid=C0183GCN8AK

cristianmtr commented 2 years ago

Just FYI the author doesn't have Issues open in their repo.

jina-ai / executors

fix(audio-clip-encoder): remove .eval which mess with the model outputs #315