google-research / simclr

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
https://arxiv.org/abs/2006.10029
Apache License 2.0
4.09k stars 624 forks source link

Cannot resume Pretraining via Checkpoint Files. #97

Closed alirezadizaji closed 3 years ago

alirezadizaji commented 3 years ago

Hi, I was pretraining simclrv2 and before finishing, the process was killed by linux kernel. so I wanted to resume pretraining by using checkpoint file via determining its directory for --checkpoint. however, I got error below.

""" Traceback (most recent call last): File "run.py", line 440, in app.run(main) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "run.py", line 428, in main data_lib.build_input_fn(builder, True), max_steps=train_steps) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3089, in train rendezvous.raise_errors() File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors six.reraise(typ, value, traceback) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3084, in train saving_listeners=saving_listeners) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default self.config) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2921, in _call_model_fn config) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3179, in _model_fn features, labels, is_export_mode=is_export_mode) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1700, in call_without_tpu return self._call_model_fn(features, labels, is_export_mode=is_export_mode) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2043, in _call_model_fn return estimator_spec.as_estimator_spec() File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 393, in as_estimator_spec scaffold = self.scaffold_fn() if self.scaffold_fn else None File "/home/alireza/Desktop/sharif_uni/RA/simclr/model.py", line 164, in scaffold_fn for v in tf.global_variables(FLAGS.variable_schema)}) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 3128, in global_variables return ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES, scope) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6377, in get_collection return get_default_graph().get_collection(key, scope) File "/home/alireza/Desktop/sharif_uni/RA/simclr/myenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4027, in get_collection regex = re.compile(scope) File "/usr/lib/python3.6/re.py", line 233, in compile return _compile(pattern, flags) File "/usr/lib/python3.6/re.py", line 301, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.6/sre_compile.py", line 562, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.6/sre_parse.py", line 855, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub not nested and not items)) File "/usr/lib/python3.6/sre_parse.py", line 616, in _parse source.tell() - here + len(this)) sre_constants.error: nothing to repeat at position 0 """

How could I resolve the issue? thanks in advance.

chentingpc commented 3 years ago

To continue training, you don't need to specify the checkpoint flag. Just set the model_dir to previous training folder, it will automatically restore from the latest checkpoint in the model and continue training, until the global_step hits the targeted total training steps.

alirezadizaji commented 3 years ago

To continue training, you don't need to specify the checkpoint flag. Just set the model_dir to previous training folder, it will automatically restore from the latest checkpoint in the model and continue training, until the global_step hits the targeted total training steps.

yes that worked, thanks so much.