huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.38k stars 25.47k forks source link

Add CLIP-ViP #22829

Open HellwayXue opened 1 year ago

HellwayXue commented 1 year ago

Model description

CLIP-ViP is a video-language model which is based on a pre-trained image-text model CLIP then further pre-trained (post-pretraining) on a large-scale video-text dataset HD-VILA-100M. This work is accepted by ICLR 2023.

Open source status

Provide useful links for the implementation

The official implementation This repo has model implementation and pre-trained weights. @hellwayxue

NielsRogge commented 1 year ago

Cool model, I've contributed X-CLIP in the past: https://huggingface.co/docs/transformers/model_doc/xclip which is an extension of CLIP for video-language pretraining. Looks like CLIP-ViP focuses more on retrieval.

Looks like a great candidate for a first model contribution as the implementation is already in HF format.

tensorpro commented 1 year ago

Hi, I'd like to help out getting this model integrated.

tensorpro commented 1 year ago

I have a general question about unit testing. The implementation guidelines indicate that the HF implementation should align with the reference model to a tolerance of .001, but I don't see that tested in all models.

In my PR I'll include an integration test analogous to clip's test.

But I've noticed some model's don't seem to do this kind of integration test. (For example, I don't see an analogous test in gptneo

Out of curiosity, why do some models not have these kinds of integration tests?

NielsRogge commented 1 year ago

Yes, ideally GPT-Neo also has integration tests that test exact logit values. However you can see here that expected output IDs and generated texts are tested.

But in any case it's always best to have an expected slice of the logits in the integration test.

tensorpro commented 1 year ago

Thanks for the clarification!

I have another question, the reference implementation reuses CLIPConfig, CLIPTextConfig, and CLIPVisionConfig directly.

Can we reuse them (via importing) in the PR directly as well? Or should we copy-paste these files and comment "Copied from transformers.clip..." at the top?

NielsRogge commented 1 year ago

In that case, you can copy the classes, call them CLIPVipConfig, CLIPVipTextConfig, etc. and add Copied from on top of them. If you then run make fix-copies from the root of the repository, all files will automatically be copied to make sure they stay consistent.

Note that you can copy entire classes, like so:

# Copied from transformers.models.clip.configuration_clip.CLIPConfig
class CLIPVipConfig(...)

but also place them on top of methods, in case only a method is the same but the class is different:


class CLIPVipConfig(...)

    # Copied from transformers.models.clip.configuration_clip.CLIPConfig.__init__
    def __init__(config):
tensorpro commented 1 year ago

Got it, thanks for the quick response!

AmanKishore commented 9 months ago

Any updates here? Looking for a good video embedding model!

da2r-20 commented 4 months ago

Any news on this?

tensorpro commented 4 months ago

I had a PR I got started on but I'll be too busy to get it merged. When I last worked on it I think I was pretty close, but I had some CLIP docs I still had to update to be specific to CLIPViP.

But it has been long enough that the code would require major reworking to work with the changes to transformers

muhammad-i-imran commented 3 months ago

Is anyone working on this? Looks like there hasn't been any activity in the branch recently...

amyeroberts commented 3 months ago

@imran-4 I don't believe anyone is currently working on this. In terms of prioritisation, we will review the first PR ready for review first, rather than comments on the issue. As #23802 is closed with inactivity, anyone in the community is free to pick this up