lovecambi / qebrain

machine translation and quality estimation
BSD 2-Clause "Simplified" License
34 stars 18 forks source link

Checkpoint error #2

Closed dimitarsh1 closed 5 years ago

dimitarsh1 commented 5 years ago

Hello

Sometimes the exp_train succeeds and sometimes it fails with the following error:

..... DataLossError (see above for traceback): Checksum does not match: stored 1669981049 vs. calculated on the restored bytes 3056795359 [[node save/RestoreV2 (defined at expert_model.py:738) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Any idea what can be the problem?

Thanks, Cheers, Dimitar

dimitarsh1 commented 5 years ago

Please ignore this error. I updated the nvidia driver and now it works fine.

Cheers, Dimitar

dimitarsh1 commented 5 years ago

Actually the error persists. Sometimes it can train a model, sometimes, it cannot.

Here is from my latest train:

 File "/home/dimitarsh1/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): Checksum does not match: stored 2584675441 vs. calculated on the restored bytes 1023513604
         [[node save/RestoreV2 (defined at expert_model.py:738)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT64, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[{{node save/RestoreV2/_273}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_171_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
lovecambi commented 5 years ago

I guess this is because the latest checkpoint is corrupted (e.g, terminate the program when model is saving.) Can you delete the latest checkpoint and try again?

lovecambi commented 5 years ago

no activity