neuroailab / tfutils

Utilities for working with tensorflow
MIT License
25 stars 8 forks source link

Filter saving to DB or retrieval might be corrupt #38

Closed qbilius closed 7 years ago

qbilius commented 7 years ago

I've trained AlexNet on kanefsky saving filters to DB every 30k iterations. Now I'm trying to get those filters and run valdiation at each step. For most steps, this works just fine. However, several steps give the following error:

INFO:tfutils:Loading checkpoint from alexnet-test.alexnet.files
INFO:tfutils:No cache file at /home/qbilius/.tfutils/localhost:31001/alexnet-test/alexnet/trainval-knf-corrected3/checkpoint-270000.tar
, loading from DB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id
: 0000:08:00.0)
INFO:tfutils:Loading checkpoint from alexnet-test.alexnet.files
INFO:tfutils:Cache file found at /home/qbilius/.tfutils/localhost:31001/alexnet-test/alexnet/trainval-knf/checkpoint-450000, using that
 to load
WARNING:tfutils:repo /home/qbilius/mh17/Dropbox (MIT)/tfutils/master/.git is dirty
WARNING:tfutils:repo /home/qbilius/mh17/Dropbox (MIT)/tfutils/master/.git is dirty
WARNING:tfutils:repo /home/qbilius/mh17/Dropbox (MIT)/tfutils/master/.git is dirty
WARNING:tfutils:repo /home/qbilius/mh17/Dropbox (MIT)/tfutils/master/.git is dirty
WARNING:tfutils:repo /home/qbilius/mh17/Dropbox (MIT)/tfutils/master/.git is dirty
WARNING:tfutils:No matching checkpoint for query "{'saved_filters': True, 'exp_id': 'trainval-knf-corrected3'}"
INFO:tfutils:Loading checkpoint from alexnet-test.alexnet.files
INFO:tfutils:Cache file found at /home/qbilius/.tfutils/localhost:31001/alexnet-test/alexnet/trainval-knf-corrected3/checkpoint-270000,
 using that to load
INFO:tfutils:Restoring variables from record 586fb66661bb4e62b531a472 (step 270000)...
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
W tensorflow/core/framework/op_kernel.cc:975] Data loss: file is too short to be an sstable
Traceback (most recent call last):
  File "train_alexnet.py", line 175, in <module>
    save_params=params['save_params'])
  File "../tfutils/base.py", line 682, in test_from_params
    dbinterface.initialize()
  File "../tfutils/base.py", line 271, in initialize
    tf_saver.restore(self.sess, cache_filename)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
     [[Node: save/RestoreV2_15 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_15/tensor_names, save/RestoreV2_15/shape_and_slices)]]
     [[Node: save/RestoreV2_6/_17 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_82_save/RestoreV2_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'save/RestoreV2_15', defined at:
  File "train_alexnet.py", line 175, in <module>
    save_params=params['save_params'])
  File "../tfutils/base.py", line 682, in test_from_params
    dbinterface.initialize()
  File "../tfutils/base.py", line 263, in initialize
    tf_saver = self.tf_saver
  File "../tfutils/base.py", line 281, in tf_saver
    self._tf_saver = tf.train.Saver(*self.tfsaver_args, **self.tfsaver_kwargs)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/qbilius/libs/miniconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

DataLossError (see above for traceback): file is too short to be an sstable
     [[Node: save/RestoreV2_15 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_15/tensor_names, save/RestoreV2_15/shape_and_slices)]]
     [[Node: save/RestoreV2_6/_17 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_82_save/RestoreV2_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
yamins81 commented 7 years ago

this is not reproducible so closing. it might be an issue sometimes but there's nothing systematic we can really do about it that's worth the time that would have to spent.