Added GRU to achieve video consistency

hustvl / ViTMatte

[Information Fusion (Vol.103, Mar. '24)] Boosting Image Matting with Pretrained Plain Vision Transformers

MIT License

363 stars 37 forks source link

First of all, your work is amazing!! I just want to make it clear that I absolutely love this result together with matte anything. However, the main problem for real applications of this type of models is the temporal inconsistency. Since you are applying the model image-wise is impossible to achieve such temporal consistency for videos. This pull request is an attempt to include all the features that made RobustVideoMatting temporally consistent, so that you can easily retrain and see if you solve the temporal inconsistency problem.

The main change is the addition of convolutional GRUs in the detail capture mode. To make it possible to reuse already trained models, I add the ConvGRU layers similar to how it was done in controlnet, by initializing at zero and creating residual connections. This way, you can share a hidden state across frames and so the model can achieve temporal consistency. Nevertheless, that is not enough, I have also added another loss function that explicitly guides the model in achieving temporal consistency.

All the code is more or less recycled from the RobustVideoMatting repository. To not break anything I have duplicated the affected files and added a '_video' suffix. The code is supposed to be backward compatible except for working with 5D tensors instead of 4D. I tried to integrate it as much as possible so that you can rapidly try this idea. However, I am aware that the difficult part of managing the data is not included in this pull request. You would need to download RobustVideoMatting dataset and train on there.

I will be more than glad to help with any doubt or contribute further if you give me directions on the hardware you use or the environment. I really want this model to have temporal consistency so that it can be used in real world applications.

First of all, your work is amazing!! I just want to make it clear that I absolutely love this result together with matte anything. However, the main problem for real applications of this type of models is the temporal inconsistency. Since you are applying the model image-wise is impossible to achieve such temporal consistency for videos. This pull request is an attempt to include all the features that made RobustVideoMatting temporally consistent, so that you can easily retrain and see if you solve the temporal inconsistency problem.

The main change is the addition of convolutional GRUs in the detail capture mode. To make it possible to reuse already trained models, I add the ConvGRU layers similar to how it was done in controlnet, by initializing at zero and creating residual connections. This way, you can share a hidden state across frames and so the model can achieve temporal consistency. Nevertheless, that is not enough, I have also added another loss function that explicitly guides the model in achieving temporal consistency.

All the code is more or less recycled from the RobustVideoMatting repository. To not break anything I have duplicated the affected files and added a '_video' suffix. The code is supposed to be backward compatible except for working with 5D tensors instead of 4D. I tried to integrate it as much as possible so that you can rapidly try this idea. However, I am aware that the difficult part of managing the data is not included in this pull request. You would need to download RobustVideoMatting dataset and train on there.

I will be more than glad to help with any doubt or contribute further if you give me directions on the hardware you use or the environment. I really want this model to have temporal consistency so that it can be used in real world applications.

I was wondering if you went ahead and trained a model with this or anything further occurred?

hustvl / ViTMatte

Added GRU to achieve video consistency #14