tensorflow / hub

A library for transfer learning by reusing parts of TensorFlow models.
https://tensorflow.org/hub
Apache License 2.0
3.47k stars 1.66k forks source link

DataLossError: Checksum does not match: #305

Closed jhihn closed 5 years ago

jhihn commented 5 years ago

I am an admin for a Tensorflow machine, it has dual 1080 Tis. The primary user is reporting the repeated DataLossErrors. The disk this is happening on is a NVME disk with 596 Gigs free. 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

It can run hours, sometimes days before this happens.

What can you recommend on how to prevent this, to log adequate information, or otherwise deal with this? There is no reason why the "corruption" should be occurring that I know of. All virus scan is disabled, there is only one active user.

2019-05-20 05:13:23.350595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-20 05:13:23.350740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-05-20 05:23:23.907853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-05-20 05:23:23.907930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-20 05:23:23.907939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-05-20 05:23:23.907944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y
2019-05-20 05:23:23.907948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N
2019-05-20 05:23:23.908218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-20 05:23:23.908353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-05-20 05:33:24.052697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-05-20 05:33:24.052778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-20 05:33:24.052787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-05-20 05:33:24.052791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y
2019-05-20 05:33:24.052796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N
2019-05-20 05:33:24.053068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-20 05:33:24.053208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-05-20 05:43:23.737224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-05-20 05:43:23.737303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-20 05:43:23.737312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-05-20 05:43:23.737317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y
2019-05-20 05:43:23.737330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N
2019-05-20 05:43:23.737606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-20 05:43:23.737743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-05-20 05:43:24.488949: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: Checksum does not match: stored 4061069822 vs. calculated on the restored bytes 2733927205
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.908
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.692
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.773
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.729
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.750
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.792
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.729
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.807
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.782
creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=3.26s).
Accumulating evaluation results...
DONE (t=0.47s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.738
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.972
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.903
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.668
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.786
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.723
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.550
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.749
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.790
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.709
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.821
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.781
creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=3.08s).
Accumulating evaluation results...
DONE (t=0.47s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.725
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.970
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.907
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.669
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.761
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.704
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.537
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.736
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.778
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.705
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.799
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.773
creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=3.33s).
Accumulating evaluation results...
DONE (t=0.49s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.749
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.977
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.922
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.684
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.780
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.737
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.753
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.795
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.728
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.814
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.784
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4061069822 vs. calculated on the restored bytes 2733927205
         [[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[{{node save/RestoreV2/_761}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_765_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "object_detection/model_main.py", line 109, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 582, in after_run
    if self._save(run_context.session, global_step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 607, in _save
    if l.after_save(session, step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line
1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line
1255, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line
1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line
1320, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py",
 line 582, in after_run
    if self._save(run_context.session, global_step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py",
 line 607, in _save
    if l.after_save(session, step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
    return _evaluate()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 467, in _evaluate
    output_dir=self.eval_dir(name))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1591, in _evaluate_run
    config=self._session_config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 202, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1546, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4061069822 vs. calculated on the restored bytes 2733927205
         [[node save/RestoreV2 (defined at /home/anika/models/research/object_detection/model_lib.py:475)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[{{node save/RestoreV2/_761}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_765_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'save/RestoreV2', defined at:
  File "object_detection/model_main.py", line 109, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 582, in after_run
    if self._save(run_context.session, global_step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 607, in _save
    if l.after_save(session, step):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
    return _evaluate()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 460, in _evaluate
    self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1484, in _evaluate_build_graph
    self._call_model_fn_eval(input_fn, self.config))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1520, in _call_model_fn_eval
    features, labels, model_fn_lib.ModeKeys.EVAL, config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/anika/models/research/object_detection/model_lib.py", line 475, in model_fn
    save_relative_paths=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1102, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): Checksum does not match: stored 4061069822 vs. calculated on the restored bytes 2733927205
         [[node save/RestoreV2 (defined at /home/anika/models/research/object_detection/model_lib.py:475)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[{{node save/RestoreV2/_761}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_765_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables_1/boolean_mask/GatherV2:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "object_detection/model_main.py", line 109, in <module>
    tf.app.run()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))  File "object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    return self  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)  File "/usr/local/lib/python3.  File "object_detection/model_main.py", line 109, in <module>
    tf.app.run()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", lin
e 125, in run
    _sys.exit(main(argv))  File "object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])  File "/usr/local/lib/python
3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/tr
aining.py", line 610, in run
    return self.run_local()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/
training.py", line 711, in run_local
    saving_listeners=saving_listeners)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python
/estimator/estimator.py", line 356, in train
    return self  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py
", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)  File "/usr/local/lib/python3.
5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1167, in run
    self._sess = None  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
    run_metadata=run_metadata))  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 582, in after_run
    if self._save(run_context.session, global_step):  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 607, in _save
    if l.after_save(session, step):  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
    hooks=self._eval_spec.hooks)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
    return _evaluate()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 467, in _evaluate
    output_dir=self.eval_dir(name))  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1591, in _evaluate_run
    config=self._session_config)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1121, in _create_session
    'the job. Error: %s', e)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 219, in finalize
    return self  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 268, in get_or_default
    return op  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 199, in default_ready_for_local_init_op
    variables.global_variables())  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py", line 189, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
==================================
gowthamkpr commented 5 years ago

@jhihn This means your checkpoint file got corrupted. Delete the latest version (i.e. the one with the largest global_step number) and try again and it should work.

jhihn commented 5 years ago

That's not really an acceptable fix. It happens regularly. There is no reason for it to happen. I think there's a problem with TF, or it's checksumming. It's not a 1-off thing. It "got corrupted" -- I question if that is true.

Perhaps there is a way to validate the file? Just because checksums differ does not mean it is "corrupt"? How do I know it was not generated as corrupt to begin with? Why is this check included? Can we further isolate the corruption?

I've read the other DataLossError threads and they all get closed because of inactivity. I think there's a problem in TF that's not getting fixed.

arnoegw commented 5 years ago

@jhihn: Have you considered the possibility that there could be a problem with the disk in use?

TensorFlow checkpoint files contain checksums to guard against accidental modifications of the stored data while being written, stored, or read back from the filesystem. They are used any time a TensorFlow program anywhere reads back a model weight from a checkpoint file, which means the code to compute and validate checksums is widely tested in practice.

In any case, this has nothing to do with TensorFow Hub.