A Generalist Framework for Panoptic Segmentation of Images and Videos

345ishaan commented 2 years ago

Model/Pipeline/Scheduler description

This work (https://arxiv.org/pdf/2210.06366.pdf) presents how we can apply the advances of diffusion modelling to generate panoptic masks for images and videos conditioned on any image input. In most works related to diffusion modelling, the noise and output space is parametrized in continuous space, however to solve the panoptic task, they bring in the concept of analog bits which allows to use same parameterization but still output discrete instance labels per pixel.

Also, authors have build this work on their previous approach to model object detection as token generation task(https://ai.googleblog.com/2022/04/pix2seq-new-language-interface-for.html)

In all, a very cool work which i feel has a potential when grounded with other modalites can provide better few shot performace on the perception tasks. It will be nice to see if we can leverage interesting features from HF to reproduce this work and also if possible to present recent of AV datasets like nuscenes or WOD.

Open source status

[ ] The model implementation is available
[ ] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

No response

patrickvonplaten commented 2 years ago

Hey @345ishaan,

Thanks a lot for this new model description (adding a label now) . Do you know if the authors released the weights by any chance?

345ishaan commented 2 years ago

Hey @345ishaan,

Thanks a lot for this new model description (adding a label now) . Do you know if the authors released the weights by any chance?

I am not able to find author's implementation yet. The code and model for pix2seq which is used as pretrained model is there though.

huggingface / diffusers