Closed dimitarsh1 closed 5 years ago
Please ignore this error. I updated the nvidia driver and now it works fine.
Cheers, Dimitar
Actually the error persists. Sometimes it can train a model, sometimes, it cannot.
Here is from my latest train:
File "/home/dimitarsh1/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
DataLossError (see above for traceback): Checksum does not match: stored 2584675441 vs. calculated on the restored bytes 1023513604
[[node save/RestoreV2 (defined at expert_model.py:738) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_273}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_171_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
I guess this is because the latest checkpoint is corrupted (e.g, terminate the program when model is saving.) Can you delete the latest checkpoint and try again?
no activity
Hello
Sometimes the exp_train succeeds and sometimes it fails with the following error:
..... DataLossError (see above for traceback): Checksum does not match: stored 1669981049 vs. calculated on the restored bytes 3056795359 [[node save/RestoreV2 (defined at expert_model.py:738) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Any idea what can be the problem?
Thanks, Cheers, Dimitar