google-research / simclr

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
https://arxiv.org/abs/2006.10029
Apache License 2.0
4.1k stars 626 forks source link

Load checkpoint for finetuning fails in tf2 #135

Closed YHYeooooong closed 3 years ago

YHYeooooong commented 3 years ago

Hi, thank you so much for sharing your work! I wanted to ask questions about fine-tuning with tf2 codes. I have been trying to save pretrain model with linear eval. I've run these two commands one after the other:

python run.py --train_mode=pretrain \
  --train_batch_size=16 --train_epochs=1 \
  --learning_rate=1.0 --weight_decay=1e-4 --temperature=0.5 \
  --dataset=cifar10 --image_size=224 --eval_split=test --resnet_depth=18 \
  --use_blur=False --color_jitter_strength=0.5 \
  --model_dir=./trained_model/simclr_tf2_pretrain --use_tpu=False

and

python run.py --train_mode=finetune \
--mode=train_then_eval --fine_tune_after_block=4 --zero_init_logits_layer=True \
--train_batch_size=16 --global_bn=False --optimizer=momentum --train_epochs=1 \
--learning_rate=0.1 --weight_decay=0.0  --warmup_epochs=0  --dataset=cifar10 \
--image_size=224 --eval_split=test --resnet_depth=152 \
--checkpoint=./trained_model/simclr_tf2_pretrain \
--model_dir=./trained_model/simclr_tf2_ft \
--use_tpu=False --eval_batch_size=16  --lineareval_while_pretraining=False

But It gives me following error :

Traceback (most recent call last): File "run.py", line 681, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "run.py", line 557, in main model, optimizer.iterations, optimizer) File "run.py", line 325, in try_restore_from_checkpoint FLAGS.checkpoint).expect_partial() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 2260, in restore status = self.read(save_path, options=options) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 2148, in read return self._saver.restore(save_path=save_path, options=options) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 1292, in restore reader = py_checkpoint_reader.NewCheckpointReader(save_path) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader error_translator(e) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 44, in error_translator raise errors_impl.DataLossError(None, None, error_message) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./trained_model/simclr_tf2_pretrain: Failed precondition: trained_model/simclr_tf2_pretrain; Is a directory: perhaps your file is in a different file format and youneed to use a different restore operator?

I try to add /checkpoint or /ckpt-6250.data-00000-of-00001 end of --checkpoint=./trained_model/simclr_tf2_pretrain, but it comes up with similar error. Could you please give me some guidance on how to solve this problem or another way to save pretrained models? Thank you for your time and kind help!

chentingpc commented 3 years ago

@saxenasaurabh

lucasliunju commented 3 years ago

I also have the same issue and I would like to ask whether you have solved it.

Thank you!

free-bit commented 3 years ago

I remember seeing this error once. I assume that the following files exist under the simclr_tf2_pretrain folder:

Can you try to pass the path as --checkpoint=./trained_model/simclr_tf2_pretrain/ckpt-6250? I think this should mitigate the error.

deepankarvarma commented 10 months ago

I remember seeing this error once. I assume that the following files exist under the simclr_tf2_pretrain folder:

  • checkpoint
  • ckpt-6250.data-00000-of-00001
  • ckpt-6250.index

Can you try to pass the path as --checkpoint=./trained_model/simclr_tf2_pretrain/ckpt-6250? I think this should mitigate the error.

can you also please confirm the directory contents of the --model_dir