Closed Zhentao-Liu closed 1 year ago
Did you run the following code?
python3 demo.py --checkpoint=./weights/simpleclick_models/cocolvis_vit_huge.pth --gpu 0
It has no issue for me. Could you provide the detailed error info?
@qinliuliuqin I can use negative click point for this model ?
@qinliuliuqin I can use negative click point for this model ?
Yes, of course.
@qinliuliuqin Thank you. I just test demo. Right click is nagative point. I have some other question : 1, How you create positive , negative , Prev. Mask and gt for training ? 2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ? 3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?
@qinliuliuqin Thank you. I just test demo. Right click is nagative point. I have some other question : 1, How you create positive , negative , Prev. Mask and gt for training ? 2, In paper, Input are image + (Clicks + Prev. Mask). What is Prev. Mask . Is binary image of previous predict ? 3, What are ['NoBRS', 'RGB-BRS', 'DistMap-BRS', 'f-BRS-A', 'f-BRS-B', 'f-BRS-C'] ?
Hi @ThorPham, Thanks for your questions.
@qinliuliuqin thank you for your support. What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model. In paper, I see you concat image + prev mask . Do you add click point ?
@qinliuliuqin thank you for your support. What is preprocessing of click point . It's mask binary mask or you create distance map . And how to you add it in model. In paper, I see you concat image + prev mask . Do you add click point ?
Hi @ThorPham, the clicks (i.e coordinates) are encoded as a 2-channel binary mask, one for positive clicks and one for negative clicks. Each click is represented as a disk on the binary mask. We concatenate the click mask (2 channels), prev mask (1 channel), and RGB image (3 channels) to form a 6-channel input. Since we want to reuse the pretrained ViT, whose patch embedding layer only accepts 3-channel input, we add one more patch embedding layer and split the 6-channel input into two groups (each group has 3 channels, as shown in Fig. 1). In this way, we can turn the plain ViT backbone into an iSeg backbone with minimal changes.
@qinliuliuqin Thank you so much.
I download your repo and the pretrained model mae_pretrain_vit_base.pth, and run demo.py, however, after I load the model, it does not exist 'config' key and 'state_dict' in the dict. how to fix it?