una-dinosauria / human-motion-prediction

Simple baselines and RNNs for predicting human motion in tensorflow. Presented at CVPR 17.
MIT License
406 stars 141 forks source link

TF Saver has issues on Windows #10

Closed aaronsnoswell closed 7 years ago

aaronsnoswell commented 7 years ago

Hi there,

Quick post below that I'll update later when I have some more time;

Thanks for this paper and for sharing your code. I'm trying to replicate your results on Windows 10, and the TensorFlow Saver class that saves the model as it is training seems to have an issue. Either the path name or the file name of the files is far too long for Windows or NTFS (I haven't determined which yet). To help me debug this, can you let me know what operating system and file system you were running this code on?

Thank you, I'll share more info later.

una-dinosauria commented 7 years ago

Hi @aaronsnoswell! I've tested this code on ubuntu 16.04 and Mac OS X. Unfortunately we don't have windows machines in the lab. Please do let me know if you fix the issue, and I'll be infinitely thankful if you can submit a PR for it when that happens.

aaronsnoswell commented 7 years ago

Thanks for the info Julieta. I was able to figure out what was going on. The full error message I was seeing is below;

(tensorflow35) E:\Aaron Snoswell PhD\Jul 2017 Having A Crack At It Again\human-motion-prediction>python src/translate.py --action walking --seq_length_out 25 --iterations 10000 --test_every 10 --save_every 10
Reading training data (seq_len_in: 50, seq_len_out 25).
Reading subject 1, action walking, subaction 1
Reading subject 1, action walking, subaction 2
Reading subject 6, action walking, subaction 1
Reading subject 6, action walking, subaction 2
Reading subject 7, action walking, subaction 1
Reading subject 7, action walking, subaction 2
Reading subject 8, action walking, subaction 1
Reading subject 8, action walking, subaction 2
Reading subject 9, action walking, subaction 1
Reading subject 9, action walking, subaction 2
Reading subject 11, action walking, subaction 1
Reading subject 11, action walking, subaction 2
Reading subject 5, action walking, subaction 1
Reading subject 5, action walking, subaction 2
done reading data.
2017-08-06 11:22:32.842004: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842120: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842148: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842169: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842191: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842212: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842233: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-06 11:22:32.842255: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Creating 1 layers of 1024 units.
One hot is  True
Input size is 55
rnn_size = 1024
output_size = 55
 state_size = 1024
Creating model with fresh parameters.
Model created
step 0000; step_loss: 0.9429

milliseconds     |    80 |   160 |   320 |   400 |   560 |  1000 |
walking          | 1.678 | 1.657 | 1.630 | 1.625 | 1.630 | 1.603 |

============================
Global step:         10
Learning rate:       0.0050
Step-time (ms):     1678.4439
Train loss avg:      0.9759
--------------------------
Val loss:            1.2118
srnn loss:           1.0169
============================

Saving the model...
2017-08-06 11:23:05.191146: W C:\tf_jenkins\home\workspace\nightly-win\M\windows\PY\35\tensorflow\core\framework\op_kernel.cc:1165] Not found: Failed to create a NewWriteableFile: experiments\walking\out_25\iterations_10000\tied\sampling_based\one_hot\depth_1\size_1024\lr_0.005\not_residual_vel\checkpoint-10.data-00000-of-00001.tempstate3138973305096497355 : The system cannot find the path specified.

Traceback (most recent call last):
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 1267, in _do_call
    return fn(*args)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 1248, in _run_fn
    status, run_metadata)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a NewWriteableFile: experiments\walking\out_25\iterations_10000\tied\sampling_based\one_hot\depth_1\size_1024\lr_0.005\not_residual_vel\checkpoint-10.data-00000-of-00001.tempstate3138973305096497355 : The system cannot find the path specified.

         [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, Variable, Variable_1, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/kernel, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/kernel, proj_b_out, proj_w_out)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/translate.py", line 700, in <module>
    tf.app.run()
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "src/translate.py", line 697, in main
    train()
  File "src/translate.py", line 479, in train
    model.saver.save(sess, os.path.normpath(os.path.join(train_dir, 'checkpoint')), global_step=current_step )
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 1490, in save
    raise exc
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 1474, in save
    {self.saver_def.filename_tensor_name: checkpoint_file})
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 896, in run
    run_metadata_ptr)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 1108, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 1261, in _do_run
    options, run_metadata)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\client\session.py", line 1280, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a NewWriteableFile: experiments\walking\out_25\iterations_10000\tied\sampling_based\one_hot\depth_1\size_1024\lr_0.005\not_residual_vel\checkpoint-10.data-00000-of-00001.tempstate3138973305096497355 : The system cannot find the path specified.

         [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, Variable, Variable_1, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/kernel, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/kernel, proj_b_out, proj_w_out)]]

Caused by op 'save/SaveV2', defined at:
  File "src/translate.py", line 700, in <module>
    tf.app.run()
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "src/translate.py", line 697, in main
    train()
  File "src/translate.py", line 132, in train
    model = create_model( sess, actions )
  File "src/translate.py", line 83, in create_model
    dtype=tf.float32)
  File "E:\Aaron Snoswell PhD\Jul 2017 Having A Crack At It Again\human-motion-prediction\src\seq2seq_model.py", line 381, in __init__
    self.saver = tf.train.Saver( tf.global_variables(), max_to_keep=10 )
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 1140, in __init__
    self.build()
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 1172, in build
    filename=self._filename)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 686, in build
    save_tensor = self._AddSaveOps(filename_tensor, saveables)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 276, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\training\saver.py", line 219, in save_op
    tensors)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 766, in save_v2
    tensors=tensors, name=name)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\framework\ops.py", line 2528, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\Users\uqasnosw\AppData\Local\Continuum\Miniconda3\envs\tensorflow35\lib\site-packages\tensorflow\python\framework\ops.py", line 1203, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Failed to create a NewWriteableFile: experiments\walking\out_25\iterations_10000\tied\sampling_based\one_hot\depth_1\size_1024\lr_0.005\not_residual_vel\checkpoint-10.data-00000-of-00001.tempstate3138973305096497355 : The system cannot find the path specified.

         [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, Variable, Variable_1, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/candidate/kernel, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/bias, combined_tied_rnn_seq2seq/tied_rnn_seq2seq/gru_cell/gates/kernel, proj_b_out, proj_w_out)]]

As you can see, TensorFlow's Saver.save method was failing to create the checkpoint files. The problem turned out to be the maximum path length supported by the Windows shell (255 UTF-16 code words), not an NTFS problem. The code for me is generating a path something like E:\Aaron Snoswell PhD\Jul 2017 Having A Crack At It Again\human-motion-prediction\experiments\walking\out_25\iterations_10000\tied\sampling_based\one_hot\depth_1\size_1024\lr_0.005\not_residual_vel\checkpoint-10.data-00000-of-00001.tempstate3138973305096497355, or around 260 characters. By moving the code to a very shallow project folder (e.g. E:\hmp), this problem goes away. After applying the changes in my other pull requests, this code runs fine on Windows so far.

There are workarounds in C/C++ to get longer path names on Windows - this is an issue with the TensorFlow core library that I'll raise.

imnishantg commented 7 years ago

Just FYI. I faced exactly the same issue. As mentioned above, I was able to solve the issue by running the code through a shallow folder. Code at: https://gist.github.com/imnishantg/5067dd7c1572e0891595bf05c3d2caf0

System Info: Windows 10, 64-bit TensorFlow version: 1.2.1

Just want to check if this issue is being resolved in the later version of the TF...

Thanks Nishant

quarterpastsix commented 6 years ago

i just downloaded TF and have the same issue