Continue training on older checkpoint

atticdweller commented 5 years ago

I'm having trouble continuing training from an earlier checkpoint. It looks like 2 checkpoints ago my model started to collapse, I want to continue from that point and see if the model collapse is inevitable.

I tried making a copy of checkpoints/mycheckpoint and then swapping out the latest data,index and meta files with the files from the earlier checkpoint.

Then running python train.py --load_model=checkpoints/mycheckpointcopy

I get this python error: 2019-01-11 12:10:18.066025: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-01-11 12:10:18.414870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:01:00.0 totalMemory: 8.00GiB freeMemory: 6.62GiB 2019-01-11 12:10:18.424478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-11 12:10:19.370834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-11 12:10:19.375742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-11 12:10:19.377984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-11 12:10:19.380772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6373 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from checkpoints/4amCheckpoint\model.ckpt-79180 INFO:tensorflow:Restoring parameters from checkpoints/4amCheckpoint\model.ckpt-79180 Traceback (most recent call last): File "train.py", line 135, in <module> tf.app.run() File "C:\Users\Chris\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 131, in main train() File "train.py", line 77, in train step = int(meta_graph_path.split("-")[2].split(".")[0]) IndexError: list index out of range

atticdweller commented 5 years ago

Found a solution, if not ideal.

To switch the checkpoint you will continue training from , go to checkpoints/mycheckpoint then open the "checkpoint" file in a text editor, then change the file to read : model_checkpoint_path: "model.ckpt-79180" all_model_checkpoint_paths: "model.ckpt-79180" where the "79180" is the number of the checkpoint you want to continue from.

Good luck everybody!

houqinju2016 commented 5 years ago

I think you can solve the problem by changing the code ( "train.py", line 77) like this: train step = int(meta_graph_path.split("-")[-1].split(".")[0]) Note: the value of variable meta_graph_path equals "model.ckpt-79180.meta" where the "79180" is the number of the checkpoint you want to continue from.

vanhuyz / CycleGAN-TensorFlow

Continue training on older checkpoint #94