Can Cutie be extended to real-time stream inference?

hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation

https://hkchengrex.com/Cutie/

MIT License

579 stars 60 forks source link

Can Cutie be extended to real-time stream inference? #39

Closed CuriousTank closed 6 months ago

CuriousTank commented 6 months ago

Thank you for such an excellent project. Newcomers in the field of video segmentation, I have two questions that I need to ask you:

From a design perspective, can Cutie be extended to real-time stream inference?
Is it necessary that video input requires setting the number of objects to be tracked in advance?

I would greatly appreciate it if I could receive an answer. Thanks

CuriousTank commented 6 months ago

I think the second question should be necessary. Interestingly, I found that even without clicking on the object to be segmented, as long as the number of objects is set, automatic segmentation can still be achieved.

hkchengrex commented 6 months ago

Yes. In the evaluation script, we loop over the frames and generate the corresponding segmentation. The loop can be replaced with a camera feed input.
It is not necessary. For example, in YouTubeVOS evaluation, we can add new objects that do not appear on the first frame. Although we do not need to know the number of objects in advance, we do need a mechanism to tell which object is new and which is not. This is given in YouTubeVOS, or has to be inferred automatically. We have a project DEVA that specifically deals with the automatic setting but those functionalities are not implemented in this repo.
That is kind of a side effect of us not training on "empty" sequences. The model has never seen a memory bank with no object before and its behavior under that setting happens to be close to the "automatic" segmentation you described.

CuriousTank commented 6 months ago

Yes. In the evaluation script, we loop over the frames and generate the corresponding segmentation. The loop can be replaced with a camera feed input.

It is not necessary. For example, in YouTubeVOS evaluation, we can add new objects that do not appear on the first frame. Although we do not need to know the number of objects in advance, we do need a mechanism to tell which object is new and which is not. This is given in YouTubeVOS, or has to be inferred automatically. We have a project DEVA that specifically deals with the automatic setting but those functionalities are not implemented in this repo.

That is kind of a side effect of us not training on "empty" sequences. The model has never seen a memory bank with no object before and its behavior under that setting happens to be close to the "automatic" segmentation you described.

Thank you very much!