hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
https://hkchengrex.com/Cutie/
MIT License
579 stars 60 forks source link

Cutie Outputs Mask of Incorrect Object in Subsequent Frames #74

Closed qidihan closed 2 weeks ago

qidihan commented 1 month ago

Hello,

First, thank you for your amazing work on Cutie!

I am using SAM to generate a mask for an object, and then tracking it with a 640x360 camera with 10hz FPS. However, starting from the second frame, Cutie outputs the mask of another identical object instead of the intended one. Could you please advise on any parameters I can adjust to correct this behavior?

Additionally, how does Cutie's tracking performance compare to DEVA? Is Cutie significantly more capable?

Thank you!

1 2 3 4

hkchengrex commented 4 weeks ago

Which script are you using? One thing worth trying is to label all three objects (such that there is some "mutual exclusion") instead of just one. In general tracking similarly-looking objects is hard as we don't have strong positional constraints and don't really have these types of training data, but decaying like this so soon at the second frame is also quite rare.

Cutie should be better than DEVA in terms of just tracking.

qidihan commented 4 weeks ago

On February 27th, I cloned the code and made a specific modification to it. The change I made was to replace the read mp4 function with read camera frame. However, upon revisiting the current directory, I could not find the exact version of the code that I had previously worked on.

I am wondering if there has been an update to the repository since then, which might have altered or replaced the code I was working with. Below is the code that I'm using. I change frames_to_propagate to 500.


from omegaconf import open_dict
from hydra import compose, initialize
import torch
import numpy as np
from PIL import Image
from cutie.model.cutie import CUTIE
from cutie.inference.inference_core import InferenceCore
from cutie.inference.utils.args_utils import get_dataset_cfg
import os

import cv2
from gui.interactive_utils import image_to_torch, torch_prob_to_numpy_mask, index_numpy_to_one_hot_torch, overlay_davis

initialize(version_base='1.3.2', config_path="cutie/config", job_name="eval_config")
cfg = compose(config_name="eval_config")

with open_dict(cfg):
  cfg['weights'] = './weights/cutie-base-mega.pth'

data_cfg = get_dataset_cfg(cfg)

# Load the network weights
cutie = CUTIE(cfg).cuda().eval()
model_weights = torch.load(cfg.weights)
cutie.load_weights(model_weights)

video_name = 'examples/example2.mp4'
mask_name = 'result/resized_mask.png'
results_dir = 'result/result2'
if not os.path.exists(results_dir):
  os.makedirs(results_dir)

img= Image.open(mask_name)
img.show()

mask = np.array(Image.open(mask_name))
print(np.unique(mask))
num_objects = len(np.unique(mask)) - 1

device = 'cuda'
torch.cuda.empty_cache()

processor = InferenceCore(cutie, cfg=cfg)
cap = cv2.VideoCapture(video_name)

# You can change these two numbers
frames_to_propagate = 200
visualize_every = 20

current_frame_index = 0

if __name__ == "__main__":

  with torch.inference_mode():
    with torch.cuda.amp.autocast(enabled=True):
      while (cap.isOpened()):
        # load frame-by-frame
        _, frame = cap.read()
        if frame is None or current_frame_index > frames_to_propagate:
          break

        # convert numpy array to pytorch tensor format
        frame_torch = image_to_torch(frame, device=device)
        if current_frame_index == 0:
          # initialize with the mask
          mask_torch = index_numpy_to_one_hot_torch(mask, num_objects+1).to(device)
          # the background mask is not fed into the model
          prediction = processor.step(frame_torch, mask_torch[1:], idx_mask=False)
        else:
          # propagate only
          prediction = processor.step(frame_torch)

        # argmax, convert to numpy
        prediction = torch_prob_to_numpy_mask(prediction)

        if current_frame_index % visualize_every == 0:
          visualization = overlay_davis(frame, prediction)
          # 构建保存图像的文件路径
          result_img_path = os.path.join(results_dir, f"frame_{current_frame_index:06d}.png")
          # 保存图像
          cv2.imwrite(result_img_path, cv2.cvtColor(visualization, cv2.COLOR_RGB2BGR))
          print("saved once")

        current_frame_index += 1
hkchengrex commented 4 weeks ago

You can check the commit history. Your script might have come from the Colab notebook. Have you tried labeling all three objects instead of just one?

qidihan commented 4 weeks ago

I apologize for the confusion. I'm not quite clear on how the labeling process works. Is it about segmenting different objects before they enter 'Cutie' and then outputting them to it? Also for my final project, I'm aiming to track just one object at a time. I was thinking about whether it's possible to adjust the parameters of the diffusion or detection algorithm to focus more on the tracked object, making the model less sensitive to other objects. Could you provide some guidance on this if there is any possibles?

hkchengrex commented 4 weeks ago

It would help to label other objects as it provides some "mutual exclusion" information to the model. I am not sure how you are obtaining the input masks, but I am indeed suggesting tracking multiple objects at once.

There is no parameter that directly controls the sensitivity as far as I know. The model has been trained to minimize confusion, but like any other model, it can make mistakes. Providing masks for the other objects is one of the ways to help it avoid these mistakes.

qidihan commented 4 weeks ago

Thank you for your prompt response. I have been manually utilizing the bounding box feature in the SAM model for my project, which aims to enable the robot to grasp the object that I select. As a result, the focus is on tracking only one object at a time.

I have been testing the situation as we previously discussed. Currently, I am unable to reproduce the issue. It seems that the problem may be occurring randomly.

I will continue to closely monitor. Thanks for your support!!!

hkchengrex commented 3 weeks ago

I'm glad to see that it's a rare occurrence. Good luck with your project!

qidihan commented 2 weeks ago

Sorry for the bothering again. The issue has been recurring frequently. My current question pertains to the functionality of the mask application during the initial step. Is it possible to load all object masks at the first step, but then selectively output the mask of only one specific object in the final output? The reason for this request is to avoid the aforementioned errors and to streamline the process by focusing on a single object of interest.

hkchengrex commented 2 weeks ago

Hi. Sure, you can post-process the output mask to keep only one object.