huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.17k stars 26.33k forks source link

Add X-Decoder Model #22003

Open ChanBong opened 1 year ago

ChanBong commented 1 year ago

Model description

X-Decoder is a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.

The model exhibits strong transferability to a wide range of downstream tasks in both zero-shot and fine-tuning settings, achieving state-of-the-art open-vocabulary segmentation and referring segmentation on 10 settings of 7 datasets and should be a valuable addition to transformers library

Open source status

Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2212.11270.pdf Code: https://github.com/microsoft/X-Decoder Weights: https://huggingface.co/spaces/xdecoder/Demo/blob/main/xdecoder_focalt_last.pt

Author: @eltociear Cc: @NielsRogge @alaradirik

alaradirik commented 1 year ago

Hi @ChanBong, thanks for opening the issue!

You can expect to see an X-Decoder PR is the next two weeks :)

atharvakavitkar commented 1 year ago

Hi @alaradirik, can we please collaborate in adding this model?

alaradirik commented 1 year ago

Hi @atharvakavitkar, the PR is almost done but won't include the referring image editing task, which require integration with Stable Diffusion inpainting. Perhaps you could create a tutorial or demo for this task?

atharvakavitkar commented 1 year ago

Hi @alaradirik, thank you for reaching out to me. I must admit that I have not yet added a model to HuggingFace. But I really want to learn how to do it. Would creating this tutorial be the right step? Or should I search for a simpler model to implement from scratch?