openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.19k stars 3.35k forks source link

preprocess _transfrom function #335

Open sherlock666 opened 1 year ago

sherlock666 commented 1 year ago

For preprocess:

Compose( Resize(size=224, interpolation=bicubic, max_size=None, antialias=None) CenterCrop(size=(224, 224)) <function _convert_image_to_rgb at 0x7f6d51348af0> ToTensor() Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)) )

I don't understand that why the code uses CenterCrop with size 224,224 when image is already resized to 224,224? (Resize(size=224, interpolation=bicubic, max_size=None, antialias=None)) what's the use for CenterCrop here? is it neccesary to do centercrop here?

thanks!!

shreydan commented 1 year ago

torchvision docs: Resize

size: Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

_transform(n_px) takes in a single n_px value equal to the input_resolution of the image model. So as per the docs, the smaller edge on image becomes n_px long. Then a final CenterCrop for n_px*n_px image.