microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
968 stars 160 forks source link

Adding X-CLIP to HuggingFace Transformers #61

Closed NielsRogge closed 2 years ago

NielsRogge commented 2 years ago

Hi,

I've implemented X-CLIP as a fork of 🤗 HuggingFace Transformers, and we are planning to add it to the library soon (see https://github.com/huggingface/transformers/pull/18852). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1upFMg-FPNP_D8dxeYWTju6lpYldZk8AJ?usp=sharing

I really like the simplicity of X-CLIP, which is the main reason I decided to add it :)

As you may or may not know, each model on the HuggingFace hub has its own git repository. For example, the xclip-base-patch32 checkpoint can be found here. If you check the "files and versions" tab, you can find the converted weights of the model. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

If you haven't done so, would you be interested in joining the Microsoft organisation on the hub, such that we can store all model checkpoints there (rather than under my username)? This also enables you (and your co-authors) to have write access to the X-CLIP models on the hub, so you can edit the model cards, add new models etc.

Let me know!

Kind regards,

Niels ML Engineer @ HuggingFace

penghouwen commented 2 years ago

Hi,

I've implemented X-CLIP as a fork of 🤗 HuggingFace Transformers, and we are planning to add it to the library soon (see huggingface/transformers#18852). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1upFMg-FPNP_D8dxeYWTju6lpYldZk8AJ?usp=sharing

I really like the simplicity of X-CLIP, which is the main reason I decided to add it :)

As you may or may not know, each model on the HuggingFace hub has its own git repository. For example, the xclip-base-patch32 checkpoint can be found here. If you check the "files and versions" tab, you can find the converted weights of the model. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

If you haven't done so, would you be interested in joining the Microsoft organisation on the hub, such that we can store all model checkpoints there (rather than under my username)? This also enables you (and your co-authors) to have write access to the X-CLIP models on the hub, so you can edit the model cards, add new models etc.

Let me know!

Kind regards,

Niels ML Engineer @ HuggingFace

Hi Niels,

Thanks for your interest in our X-CLIP work. We're happy to see its integration into HuggingFace~ Really appreciate your kind supports!

Putting the model into Microsoft organization is fine to us. Thank you!

If there is any question or need any support, pls feel free to ping me.

Best Regards, Houwen Peng, Researcher @ Microsoft Research

NielsRogge commented 2 years ago

X-CLIP is now available: https://huggingface.co/docs/transformers/main/en/model_doc/xclip.

All model checkpoints are on the hub: https://huggingface.co/models?other=xclip

Would be nice to mention in the main README for people to know ;)

zyhzyh88 commented 1 year ago

Dear author: Thanks for your promising work. We have followed your code conducted on zero-shot of UCF-101, the test set has only one category, however, as training progresses, the test performance gradually decreases from the 1st epoch (The attached is our training log). We want you to seek help. Thank you!

kan-bayashi commented 1 year ago

@NielsRogge Hi, thank you for developing useful codes in hugging face. That is very helpful. I think the .bin checkpoints are converted from .pth file in this repo. Could you share the code or snippets to convert?

NielsRogge commented 1 year ago

Hi,

The conversion script can be found here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/x_clip/convert_x_clip_original_pytorch_to_hf.py

kan-bayashi commented 1 year ago

Thank you so much for fast prompt reply, @NielsRogge!

MengHao666 commented 1 year ago

when I use the infernce code with another model "microsoft/xclip-base-patch16-hmdb-16-shot", I met following error:

Traceback (most recent call last): File "E:\PycharmProjects\VideoX\X-CLIP\demo.py", line 56, in outputs = model(inputs) File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\transformers\models\x_clip\modeling_x_clip.py", line 1573, in forward vision_outputs = self.vision_model( File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\transformers\models\x_clip\modeling_x_clip.py", line 1018, in forward encoder_outputs = self.encoder( File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\transformers\models\x_clip\modeling_x_clip.py", line 959, in forward layer_outputs = encoder_layer( File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "C:\ProgramData\Anaconda3\envs\open-mmlab\lib\site-packages\transformers\models\x_clip\modeling_x_clip.py", line 441, in forward msg_token = msg_token.view(batch_size, self.num_frames, hidden_size) RuntimeError: shape '[0, 32, 768]' is invalid for input of size 6144

I use a customized video which is 1000*1000. I don't know the reason, could you give help? Thanks