SimCLR for specific domain

miguel-arrf commented 1 year ago

Hello!

I've been trying to use this library for a downstream task of segmentation.

I want to use SimCLR. I've been looking at the main_pretrain.py file and there was something that got me confused.

What's the idea behind the num_crops parameter? Shouldn't we, for each image, augment it only once, and use that in the forward pass?

My understanding is that for each image in our dataset, and implementing a custom Dataset, what the __getitem__ would return would be a tuple, with the original image and the transformed one. When isn't this the case? :)

This is the augmentations file I have:

- mean: 52.722780748721
  std: 243.6154740715541
  rrc:
    enabled: False
    crop_min_scale: 0.08
    crop_max_scale: 1.0
  color_jitter:
    prob: 0.0
    brightness: 0.8
    contrast: 0.8
    saturation: 0.8
    hue: 0.0
  grayscale:
    prob: 0.0
  gaussian_blur:
    prob: 0.0
  solarization:
    prob: 0.0
  equalization:
    prob: 0.0
  horizontal_flip:
    prob: 1.0
  crop_size: [224,224]
  num_crops: 1

- mean: 52.722780748721
  std: 243.6154740715541
  rrc:
    enabled: False
  color_jitter:
    prob: 0.0
  grayscale:
    prob: 0.0
  gaussian_blur:
    prob: 0.0
  solarization:
    prob: 0.0
  equalization:
    prob: 0.0
  horizontal_flip:
    prob: 0.0
  crop_size: [224,224]
  num_crops: 1

I'm resizing all images to size (224,244).

Thank you so much!

vturrisi commented 1 year ago

Hi. num_crops is a parameter that control the number of times an augmentation pipeline is applied. This is specially useful because it allows us to define symmetric pipelines (as the example of SimCLR in the repo), asymmetric pipelines (BYOL-style) or even multi-crop (creating 2/3 pipelines and controlling the parameters separately). In the config that you shared, only a single image is generated. Also note that you will never use the original image, you will always use (at least) 2 augmented versions of the image.

miguel-arrf commented 1 year ago

Hi, thank you for the help!

Regarding the example I gave, what would I need to change in order to use the original image?

Also, still regarding the pipelines, how does it affect the batch? In a case where we have a batch size = 128, and an assymetric pipeline with, let's say, a first branch with num_crops = 5 and the second branch with num_crops = 10 how would the batch look like?

My understanding is that the batch will have 128 random images (if using a Random Sampler), that come from the dataloader. Then, it will be in the forward pass that the num_crops will have an impact. Is this right?

Or will this 'expand' the dataset, where for each image 5 + 10 more augmented samples will be added?

Sorry for the lack of clarity. If needed I can try to be more precise in my questions.

Thank you for the help in advance!

vturrisi commented 1 year ago

Sorry for the misunderstanding. In your config, the second image is the original one. I'm just confused by your mean and std values.

If your batch is 128 and you have a single pipeline with 2 crops, you will have access to (X1, X2), where both have 128 images. Note that X1[i] and X2[i] corresponds to different augmentations of the same image.

Having more/asymmetric pipelines will just increase the number of X that you have (and it won't be more nested). For instance, with 3 pipelines, with 2, 3 and 1 crops, respectively, you will have X = (X1, X2, X3, X4, X5, X6), where X1 and X2 are generated by the first pipeline, X3, X4 and X5 by the second and X6 by the third.

miguel-arrf commented 1 year ago

Yeah, sorry, my config file wasn't very clear 😅 .

I now have 2 questions:

Regardless of the number of crops or pipelines, if the total number of images (X1...Xn) is more than 2, how does SimCLR handle it?
This might not be related with solo-learn, but, in any of your experiments, have you been able to have a loss less than 1 to SimCLR? I seem to be stuck at 1. I've already tried with different batch sizes, augmentations and hyperparameters.

vturrisi commented 1 year ago

No worries, actually my mistake.

1- SimCLR is one of the methods that we support multicrop. When computing the loss, we rely on creating a square matrix of (batch_size * augs) that tells us which elements are positives and which are negatives. We use a set of dummy unique indexes and then leverage them to construct the matrix. 2- Don't think you can get values lower than 1. This is also not necessarily a good thing as you could just be memorizing all examples. Checked my old runs and they are all around ~2 at convergence.

DonkeyShot21 commented 1 year ago

Theoretically you could have values lower than 1 (cross entropy lower bound is 0), especially with smaller batch size and temperature, but the training loss is not a very significant metric to predict the performance of the model. For instance, collapsed solutions might have loss=0 but random accuracy on the val set.

vturrisi commented 1 year ago

Just to clarify, I said you cannot get values lower than 1 on the datasets that we support and have a working model.

vturrisi commented 1 year ago

Few free to re-open if you have any questions.

vturrisi / solo-learn

SimCLR for specific domain #349