Open ygfrancois opened 1 year ago
Hi @ygfrancois , thanks for your questions. Yes I agree with you. We do have tried to initialize the visual encoder and text encoder with CLIP's pretrained weights. However, we encountered some engineering problem that we didn't solve yet.
In the original CLIP's model, they convert the model into FP16 (https://github.com/openai/CLIP/blob/main/clip/model.py#L434). If we keep this line for pretraining and disable mixed precision training, the loss will become NaN at some point. If we remove it and train with FP32 or mixed precision, the performance will be very bad on video retrieval task.
To verify the problem (convert_weights(model)
), we also tried to finetune CLIP using CLIP4Clip's codebase with and without this line, the performance with FP32/mixed precision is ~3% worse than with FP16 on MSR-VTT dataset. We also posted this issue on CLIP4Clip's github (https://github.com/ArrowLuo/CLIP4Clip/issues/96) but there is no response yet.
However, this seems not to be an issue on video classification work like XCLIP that removed this line.
Please let me know if you have any thoughts!
Hi @ygfrancois , thanks for your questions. Yes I agree with you. We do have tried to initialize the visual encoder and text encoder with CLIP's pretrained weights. However, we encountered some engineering problem that we didn't solve yet.
In the original CLIP's model, they convert the model into FP16 (https://github.com/openai/CLIP/blob/main/clip/model.py#L434). If we keep this line for pretraining and disable mixed precision training, the loss will become NaN at some point. If we remove it and train with FP32 or mixed precision, the performance will be very bad on video retrieval task.
To verify the problem (
convert_weights(model)
), we also tried to finetune CLIP using CLIP4Clip's codebase with and without this line, the performance with FP32/mixed precision is ~3% worse than with FP16 on MSR-VTT dataset. We also posted this issue on CLIP4Clip's github (ArrowLuo/CLIP4Clip#96) but there is no response yet. However, this seems not to be an issue on video classification work like XCLIP that removed this line.Please let me know if you have any thoughts!
FP16 and FP32 do have big difference when using temperature (logits scale before softmax). Maybe check the difference of the temperature setting between yours and pretrained CLIP ?
Hi, thanks a lot for sharing your solid work, I have learned much from your paper and code. Here I still have a question about the part of temporal modeling. I saw that you have compared the performance between Timesformer and XCLIP, which show that Timesformer works better, but in the paper of XCLIP, it used pretrained CLIP weights, and XCLIP found a trade-off way between keeping performance of pretrained CLIP weights and Temporal modeling. I want to ask if you have test the performance of using XCLIP with pretrained CLIP, and did you found the way to used both Timesformer's temporal modeling and CLIP pretrained weights, which I think will beat XCLIP in theory. 😊