Closed askerlee closed 4 months ago
Maybe the missing library is? https://github.com/OFA-Sys/Chinese-CLIP I'll install it and try again.
The dependency issues are fixed by installing Chinese-CLIP. However when loading the provided checkpoint, I got errors:
Missing key(s) in state_dict: "text_projection_left.0.weight", "text_projection_left.0.bias", "text_projection_right.0.weight", "text_projection_right.0.bias", "global_feature_mapping.bias", "single_feature_mapping.weight", "single_feature_mapping.bias".
Unexpected key(s) in state_dict: "left_feature_mapping.weight", "right_feature_mapping.weight", "text_projection_left.4.weight", "text_projection_left.4.bias", "text_projection_right.4.weight", "text_projection_right.4.bias".
size mismatch for text_projection.0.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([512, 768]).
size mismatch for text_projection.0.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for text_projection.2.weight: copying a param with shape torch.Size([512, 768]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for text_projection_left.2.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for text_projection_left.2.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for text_projection_right.2.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for text_projection_right.2.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([512]).
Any thought how to fix it? Thanks. (I've tried "RoBERTa-wwm-ext-large-chinese", "RBT3-chinese" as the text model but there are more errors. So the closest model is "RoBERTa-wwm-ext-base-chinese".)
We are deeply sorry for the confusion and difficulties caused by our negligence in your reproduction process. In order to improve the readability of the code, and to open source for community use, we have modified the original code structure and some variable names and file names. Mistakes were made due to our negligence in the process. At this time, due to a business trip, we are unable to upload the corrected version immediately. We promise that we will update the code on July 16 Pacific time to ensure a reproducible, error-free version. Once again, I would like to express my sincere apologies for the difficulties caused to you and for wasting your valuable time.
ModuleNotFoundError: No module named 'cn_clip_v4' For this error, you can change 'cn_clip_v4' to 'RET_CLIP'.
Missing key(s) in state_dict For this error, it may be due to an error when we adjust the content and structure of the code, please wait for the repair. We will upload the correct version on July 16 Pacific time.
We are deeply sorry for the confusion and difficulties caused by our negligence in your reproduction process. In order to improve the readability of the code, and to open source for community use, we have modified the original code structure and some variable names and file names. Mistakes were made due to our negligence in the process. At this time, due to a business trip, we are unable to upload the corrected version immediately. We promise that we will update the code on July 16 Pacific time to ensure a reproducible, error-free version. Once again, I would like to express my sincere apologies for the difficulties caused to you and for wasting your valuable time.
That would be great! No worries. Looking forward to your update.
@askerlee The code has been updated. We also provide an example bash file that might be used to run the model training process. Thank you for your interest in our work!
Thank you for the timely update! Now I'm able to run the model with some minor code changes.
May I know what's the difference between img_l and img_r of model.forward()? If I only got one image, should I feed it as img_l or try twice, once as img_l, once as img_r and then average the two results? Thanks.
BTW I'll create a pull request about my patches to some minor issues.
Feeding the input to img_l vs. img_r seems to produce drastically different predictions:
Use img_l: (Pdb) logits tensor([[0.1354, 0.0876, 0.1117, 0.2641, 0.0782, 0.0287, 0.0118, 0.0373, 0.2453], [0.1395, 0.1839, 0.0553, 0.1086, 0.0476, 0.0328, 0.0195, 0.0300, 0.3829], [0.1397, 0.0416, 0.0521, 0.1958, 0.1826, 0.0172, 0.0137, 0.0431, 0.3143], [0.1929, 0.0835, 0.0235, 0.1735, 0.0402, 0.0305, 0.0084, 0.0062, 0.4413], [0.1361, 0.1260, 0.0280, 0.1034, 0.0182, 0.0240, 0.0098, 0.0055, 0.5489], [0.1229, 0.0558, 0.1490, 0.2393, 0.1560, 0.0120, 0.0212, 0.0904, 0.1533], [0.0601, 0.0455, 0.0228, 0.1131, 0.0642, 0.0068, 0.0089, 0.0170, 0.6617], [0.0546, 0.0508, 0.0261, 0.2505, 0.0709, 0.0050, 0.0120, 0.0128, 0.5173], [0.1117, 0.0801, 0.1432, 0.1133, 0.0589, 0.0472, 0.0152, 0.0452, 0.3853], [0.1407, 0.0754, 0.1285, 0.2312, 0.1012, 0.0243, 0.0170, 0.0466, 0.2351], [0.0623, 0.0201, 0.0282, 0.4909, 0.0511, 0.0077, 0.0062, 0.0154, 0.3180], [0.0970, 0.0375, 0.0290, 0.6072, 0.0715, 0.0140, 0.0065, 0.0105, 0.1268], [0.0707, 0.0325, 0.0483, 0.3265, 0.1443, 0.0072, 0.0193, 0.0636, 0.2875], [0.0297, 0.0166, 0.0245, 0.6034, 0.0693, 0.0025, 0.0129, 0.0488, 0.1922], [0.0299, 0.0183, 0.0095, 0.3828, 0.0332, 0.0020, 0.0064, 0.0124, 0.5054], [0.1015, 0.0338, 0.0303, 0.1537, 0.1728, 0.0064, 0.0171, 0.0359, 0.4484],
Use img_r: (Pdb) logits tensor([[0.0058, 0.0374, 0.0628, 0.5351, 0.1127, 0.0860, 0.0074, 0.0697, 0.0832], [0.0057, 0.1159, 0.0441, 0.2133, 0.1130, 0.0155, 0.0174, 0.1297, 0.3454], [0.0048, 0.0190, 0.0627, 0.6662, 0.0209, 0.0138, 0.0047, 0.0766, 0.1314], [0.0070, 0.1305, 0.0440, 0.0895, 0.0292, 0.0088, 0.0200, 0.0536, 0.6175], [0.0056, 0.1318, 0.0440, 0.1428, 0.0456, 0.0087, 0.0135, 0.1192, 0.4887], [0.0032, 0.0041, 0.0209, 0.8291, 0.0282, 0.0166, 0.0017, 0.0164, 0.0799], [0.0076, 0.1750, 0.0365, 0.2217, 0.0953, 0.0235, 0.0266, 0.1811, 0.2326], [0.0154, 0.0950, 0.0361, 0.3953, 0.0865, 0.0450, 0.0151, 0.1969, 0.1147], [0.0085, 0.0261, 0.0608, 0.5295, 0.0441, 0.0310, 0.0085, 0.0793, 0.2122], [0.0035, 0.0354, 0.0525, 0.4752, 0.0865, 0.0170, 0.0039, 0.0872, 0.2387], [0.0204, 0.0592, 0.0554, 0.3314, 0.1388, 0.0433, 0.0028, 0.0436, 0.3051], [0.0251, 0.0657, 0.0445, 0.1763, 0.0892, 0.0354, 0.0079, 0.0381, 0.5177], [0.0076, 0.0132, 0.0475, 0.4886, 0.0342, 0.0325, 0.0079, 0.1148, 0.2538], [0.0088, 0.0179, 0.0763, 0.3042, 0.0276, 0.0574, 0.0148, 0.0980, 0.3950], [0.0313, 0.0875, 0.0225, 0.4804, 0.0827, 0.0420, 0.0179, 0.0891, 0.1465], [0.0072, 0.0240, 0.0334, 0.4824, 0.0756, 0.0206, 0.0061, 0.0572, 0.2935],
The label file is:
normal
healthy
macular edema
diabetic retinopathy
glaucoma
macular hole
lesion
lesion in the macula
myopia
If you use the model for downstream task testing, entering an image will default to img_l, but if you use it as img_r, theoretically there will be no difference. Because when you look at the model.encode_image(), you will see that the model treats img_l or img_r individually the same way.
def encode_image(self, img_l, img_r, mask_ratio=0):
if img_r is None:
if isinstance(self.visual, ModifiedResNet):
# mask_ratio > 0 (FLIP strategy) is currently only implemented for VisualTransformer.
vision_feature = self.visual(img_l.type(self.dtype))
return vision_feature
vision_feature = self.visual(img_l.type(self.dtype), mask_ratio)
return vision_feature
if img_l is None:
if isinstance(self.visual, ModifiedResNet):
# mask_ratio > 0 (FLIP strategy) is currently only implemented for VisualTransformer.
vision_feature = self.visual(img_r.type(self.dtype))
return vision_feature
vision_feature = self.visual(img_r.type(self.dtype), mask_ratio)
return vision_feature
If you want to train on your own data, unfortunately, our model is designed to be in the presence of both eyes. If you only have monocular data, you can follow the traditional CLIP training strategy directly. I hope I understand your question. If I didn't answer your question, feel free to provide more details.
'Feeding the input to img_l vs. img_r seems to produce drastically different predictions:' For this issue, I'm guessing you need to take the image features output by the model and put them through a simple dimensional transformation and softmax layer, such as the following.
classifier = torch.nn.Sequential(
torch.nn.Linear(512, num_classes),
torch.nn.Softmax()
)
I'm guessing that there is a difference in the parameters of the LINEAR layer for you to perform these two tests, and as I mentioned before, theoretically there is no difference in how the model handles img_l and img_r.
I'm testing using zeroshot_evaluation.py. I tried extracting image features by putting images in img_l and img_r in two model(img_l, img_r, text) calls, but the two tensors seems quite different:
image_features_left = model(images, None, None)
image_features_right = model(None, images, None)
Comparing the values of image_features_left and image_features_right:
(Pdb) image_features_left.shape
torch.Size([28, 512])
(Pdb) (image_features_left!=image_features_right).any()
tensor(True, device='cuda:0')
(Pdb) (image_features_left!=image_features_right).sum()
tensor(14336, device='cuda:0')
(Pdb) (image_features_left==image_features_right).sum()
tensor(0, device='cuda:0')
(Pdb) (image_features_left - image_features_right).abs().mean()
tensor(0.3899, device='cuda:0')
This seems unreasonable. I have tested it using the RFMID dataset and there is no issue as you mentioned. You can check the images I uploaded for details. For my test results, although there is a difference between the two, this is normal and you can see that the average difference between the two is very small. If it's convenient, you can provide more details of your code (e.g. how you loaded the data) or provide specific data samples so that I can better address your issue. BTW, our paper accepted by MICCAI'24 was not tested for zero-shot downstream task at this time, and the zeroshot_evaluation.py file is from the original CLIP. However, you can absolutely use our pre-trained model for zero-shot testing by simply rewriting the code accordingly.
Sorry for the late reply. Yes I've figured out that the non-reproducible issue was caused by wrong configurations of pytorch/cuda. After updating pytorch and cuda, img_l and img_r features become identical. Thank you so much for your help.
One more question, the text_projection_left and text_projection_right still produce different text embeddings from the same text. Currently my strategy is to take an average of text_left and text_right. What would you suggest to do to achieve the best matching with the input image? Thanks.
First thanks the authors for sharing this very useful model.
I got the following error running zeroshot_evaluation.py: File "/home/user/RET-CLIP/./RET_CLIP/clip/utils.py", line 13, in
from cn_clip_v4.clip import _tokenizer
ModuleNotFoundError: No module named 'cn_clip_v4'
Seems there are a few files missing? Thanks.