Open HellwayXue opened 1 year ago
Cool model, I've contributed X-CLIP in the past: https://huggingface.co/docs/transformers/model_doc/xclip which is an extension of CLIP for video-language pretraining. Looks like CLIP-ViP focuses more on retrieval.
Looks like a great candidate for a first model contribution as the implementation is already in HF format.
Hi, I'd like to help out getting this model integrated.
I have a general question about unit testing. The implementation guidelines indicate that the HF implementation should align with the reference model to a tolerance of .001, but I don't see that tested in all models.
In my PR I'll include an integration test analogous to clip's test.
But I've noticed some model's don't seem to do this kind of integration test. (For example, I don't see an analogous test in gptneo
Out of curiosity, why do some models not have these kinds of integration tests?
Yes, ideally GPT-Neo also has integration tests that test exact logit values. However you can see here that expected output IDs and generated texts are tested.
But in any case it's always best to have an expected slice of the logits in the integration test.
Thanks for the clarification!
I have another question, the reference implementation reuses CLIPConfig, CLIPTextConfig, and CLIPVisionConfig directly.
Can we reuse them (via importing) in the PR directly as well? Or should we copy-paste these files and comment "Copied from transformers.clip..." at the top?
In that case, you can copy the classes, call them CLIPVipConfig
, CLIPVipTextConfig
, etc. and add Copied from
on top of them. If you then run make fix-copies
from the root of the repository, all files will automatically be copied to make sure they stay consistent.
Note that you can copy entire classes, like so:
# Copied from transformers.models.clip.configuration_clip.CLIPConfig
class CLIPVipConfig(...)
but also place them on top of methods, in case only a method is the same but the class is different:
class CLIPVipConfig(...)
# Copied from transformers.models.clip.configuration_clip.CLIPConfig.__init__
def __init__(config):
Got it, thanks for the quick response!
Any updates here? Looking for a good video embedding model!
Any news on this?
I had a PR I got started on but I'll be too busy to get it merged. When I last worked on it I think I was pretty close, but I had some CLIP docs I still had to update to be specific to CLIPViP.
But it has been long enough that the code would require major reworking to work with the changes to transformers
Is anyone working on this? Looks like there hasn't been any activity in the branch recently...
@imran-4 I don't believe anyone is currently working on this. In terms of prioritisation, we will review the first PR ready for review first, rather than comments on the issue. As #23802 is closed with inactivity, anyone in the community is free to pick this up
Model description
CLIP-ViP is a video-language model which is based on a pre-trained image-text model CLIP then further pre-trained (post-pretraining) on a large-scale video-text dataset HD-VILA-100M. This work is accepted by ICLR 2023.
Open source status
Provide useful links for the implementation
The official implementation This repo has model implementation and pre-trained weights. @hellwayxue