uncbiag / SimpleClick

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers (ICCV 2023)
MIT License
209 stars 32 forks source link

How to use demo.py #4

Closed Zhentao-Liu closed 1 year ago

Zhentao-Liu commented 1 year ago

I download your repo and the pretrained model mae_pretrain_vit_base.pth, and run demo.py, however, after I load the model, it does not exist 'config' key and 'state_dict' in the dict. how to fix it?

qinliuliuqin commented 1 year ago

Did you run the following code?

python3 demo.py --checkpoint=./weights/simpleclick_models/cocolvis_vit_huge.pth --gpu 0

It has no issue for me. Could you provide the detailed error info?

ThorPham commented 1 year ago

@qinliuliuqin I can use negative click point for this model ?

qinliuliuqin commented 1 year ago

@qinliuliuqin I can use negative click point for this model ?

Yes, of course.

ThorPham commented 1 year ago

@qinliuliuqin Thank you. I just test demo. Right click is nagative point. I have some other question : 1, How you create positive , negative , Prev. Mask and gt for training ? 2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ? 3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?

qinliuliuqin commented 1 year ago

@qinliuliuqin Thank you. I just test demo. Right click is nagative point. I have some other question : 1, How you create positive , negative , Prev. Mask and gt for training ? 2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ? 3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?

Hi @ThorPham, Thanks for your questions.

  1. This line shows how to create pos and neg clicks during training. This line shows that we concatenate the image and the previous mask (i.e. a probability map), along with the click masks, as the network input.
  2. See here. The previous mask is a probability map obtained by the model in the evaluation mode.
  3. We only tested 'NoBRS' mode. Forget about other modes-they are inherited from RITM.
ThorPham commented 1 year ago

@qinliuliuqin thank you for your support. What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model. In paper, I see you concat image + prev mask . Do you add click point ?

qinliuliuqin commented 1 year ago

@qinliuliuqin thank you for your support. What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model. In paper, I see you concat image + prev mask . Do you add click point ?

Hi @ThorPham, the clicks (i.e coordinates) are encoded as a 2-channel binary mask, one for positive clicks and one for negative clicks. Each click is represented as a disk on the binary mask. We concatenate the click mask (2 channels), prev mask (1 channel), and RGB image (3 channels) to form a 6-channel input. Since we want to reuse the pretrained ViT, whose patch embedding layer only accepts 3-channel input, we add one more patch embedding layer and split the 6-channel input into two groups (each group has 3 channels, as shown in Fig. 1). In this way, we can turn the plain ViT backbone into an iSeg backbone with minimal changes.

ThorPham commented 1 year ago

@qinliuliuqin Thank you so much.