Open atticdweller opened 5 years ago
Found a solution, if not ideal.
To switch the checkpoint you will continue training from , go to checkpoints/mycheckpoint
then open the "checkpoint" file in a text editor,
then change the file to read :
model_checkpoint_path: "model.ckpt-79180" all_model_checkpoint_paths: "model.ckpt-79180"
where the "79180" is the number of the checkpoint you want to continue from.
Good luck everybody!
I think you can solve the problem by changing the code ( "train.py", line 77) like this: train step = int(meta_graph_path.split("-")[-1].split(".")[0]) Note: the value of variable meta_graph_path equals "model.ckpt-79180.meta" where the "79180" is the number of the checkpoint you want to continue from.
I'm having trouble continuing training from an earlier checkpoint. It looks like 2 checkpoints ago my model started to collapse, I want to continue from that point and see if the model collapse is inevitable.
I tried making a copy of checkpoints/mycheckpoint and then swapping out the latest data,index and meta files with the files from the earlier checkpoint.
Then running python train.py --load_model=checkpoints/mycheckpointcopy
I get this python error:
2019-01-11 12:10:18.066025: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-01-11 12:10:18.414870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:01:00.0 totalMemory: 8.00GiB freeMemory: 6.62GiB 2019-01-11 12:10:18.424478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-11 12:10:19.370834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-11 12:10:19.375742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-11 12:10:19.377984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-11 12:10:19.380772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6373 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from checkpoints/4amCheckpoint\model.ckpt-79180 INFO:tensorflow:Restoring parameters from checkpoints/4amCheckpoint\model.ckpt-79180 Traceback (most recent call last): File "train.py", line 135, in <module> tf.app.run() File "C:\Users\Chris\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 131, in main train() File "train.py", line 77, in train step = int(meta_graph_path.split("-")[2].split(".")[0]) IndexError: list index out of range