[New model] ImageBind: One Embedding Space To Bind Them All

xenova commented 1 year ago

Model description

As stated in their blog post,

"[ImageBind is] the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position."

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

GitHub repo: https://github.com/facebookresearch/ImageBind Paper: https://facebookresearch.github.io/ImageBind/paper Blog: https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/ Demo: https://imagebind.metademolab.com/ Video: https://dl.fbaipublicfiles.com/imagebind/imagebind_video.mp4 Weights: https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth (currently only 1 that I can see)

shehanmunasinghe commented 1 year ago

Hi @xenova , I would like to work on implementing this model.

xenova commented 1 year ago

Hi @xenova , I would like to work on implementing this model.

Sweet!

dg845 commented 1 year ago

Hi, since it looks like the PR for this model (#23284) has been closed, I would be interested in working on a new PR to implement the ImageBind model :)

dg845 commented 1 year ago

I have opened a new PR to implement the ImageBind model: #26310.

huggingface / transformers

[New model] ImageBind: One Embedding Space To Bind Them All #23240

Model description

Open source status

Provide useful links for the implementation