google-research / simclr

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
https://arxiv.org/abs/2006.10029
Apache License 2.0
4.02k stars 621 forks source link

Transfer learning #163

Open fjsaezm opened 2 years ago

fjsaezm commented 2 years ago

Hi there! I have successfully trained a few models using the tensorflow version of simclr! Now, I would like to transfer the encoder learning to a different dataset (Concretely, from CIFAR10 to Imagenette) to see how the model performs.

Using train_then_eval and mode = finetune , I get to train the linear head (or , at least, some training logs are shown), but when it comes to evaluating, I get the following errors:

Input to reshape is a tensor with X values, but the requested shape has Y values

What am I doing wrong?

Thank u guys!

chentingpc commented 2 years ago

Do you know which line of code leads to this error? and have you set the image size, model size, etc. the same for train run and for eval run?

fjsaezm commented 2 years ago

Thank you for your answer!

I had the same image size, resnet architecture, both for the train then eval and the finetune run. I did not change anything in the code itself, just tried to do it by the flags passed to the model.

For reference, I trained the model using the following command:

python run.py --mode=train_then_eval --train_epochs=100 --learning_rate=1.0 --dataset=cifar10 --image_size=32 --eval_split=test --use_blur=True --use_tpu=False --train_batch_size=256 --temperature=0.25 --weight_decay=0.0001 --color_jitter_strength=0.65 --resnet_depth=50 --model_dir=models/256-0.25-0.0001-0.65-50-blur

For trying to finetune in a different dataset, I used the command:

python run.py --mode=train_then_eval --train_mode=finetune --fine_tune_after_block=4 --zero_init_logits_layer=True --global_bn=False --optimizer=momentum --learning_rate=0.1 --weight_decay=0.0 \--train_epochs=100 --train_batch_size=256 --warmup_epochs=0 --dataset=imagenette/160px-v2 --image_size=32 --eval_split=test --resnet_depth=50 --checkpoint=models/256-0.25-0.0001-0.65-50-blur/saved_model/19695 --model_dir=models/imagenette/ --use_tpu=False --eval_split=validation

The error I get is:

 File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run.py", line 667, in main
    perform_evaluation(model, builder, eval_steps,
  File "run.py", line 401, in perform_evaluation
    run_single_step(iterator)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 956, in _call
    return self._concrete_stateful_fn._call_flat(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Input to reshape is a tensor with 90720 values, but the requested shape has 3072
     [[node Reshape (defined at run.py:395) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[OptionalHasValue_1/_4]]
  (1) Invalid argument:  Input to reshape is a tensor with 90720 values, but the requested shape has 3072
     [[node Reshape (defined at run.py:395) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
0 successful operations.
0 derived errors ignored. [Op:__inference_run_single_step_146536]

Function call stack:
run_single_step -> run_single_step

Maybe I should just change mode to pretrain (removing the train_then_eval) and then use the weights of the pretrain? I just thought the weights saved from the train_then_eval were the same as the ones saved from pretrain

Thanks!

chentingpc commented 2 years ago

I think you hit an edge case/bug :) since you use image_size=32, center_crop at eval time is disabled, so the code directly reshape a raw image of size >32x32x3 into 32x32x3 which leads to the error. What you could do to resolve it is simply resize the image to target size before reshape. This was not a problem on CIFAR10 as train/test images are all of the same 32x32 image size.

fjsaezm commented 2 years ago

Thank you for your answer again!

As you have just suggested, I have forced the preprocessing to resize the image to 32x32, and I have printed the shape of the tensor and everything seems to be fine. I have added

image = tf.image.resize(image,[32,32],preserve_aspect_ratio=True)
  image = tf.reshape(image, [height, width, 3])
  image = tf.clip_by_value(image, 0., 1.)
  return image

However, despite the numbers changing on the error and being closer one to each other, the error persists.

Traceback (most recent call last):
  File "run.py", line 676, in <module>
    app.run(main)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run.py", line 667, in main
    perform_evaluation(model, builder, eval_steps,
  File "run.py", line 401, in perform_evaluation
    run_single_step(iterator)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/fjaviersaezm/miniconda3/envs/simclr/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Input to reshape is a tensor with 2592 values, but the requested shape has 3072
     [[node Reshape (defined at run.py:395) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[OptionalHasValue_1/_4]]
  (1) Invalid argument:  Input to reshape is a tensor with 2592 values, but the requested shape has 3072
     [[node Reshape (defined at run.py:395) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
0 successful operations.
0 derived errors ignored. [Op:__inference_run_single_step_5886]

Function call stack:
run_single_step -> run_single_step

2592 seems to be 27x32x3 , so 5 rows are removed. Is there any other place where this may happen??

Thank you in advance!