Issue with saving training onto model checkpoint

ImNotOssy commented 6 months ago

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,

restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

ImNotOssy commented 6 months ago

fixed it, I migrated from colab to google cloud VM Instance after running out of compute units, and realizing i have $300 in credits to spend. i went all out and got myself a nice VM with blazing training speeds. I was able to get it to work after realizing the minimum step to save was way higher than my total training step. I got it to work but i feel as if my data is way way to small for the model to learn anything. i tried to run it using demo.py but i get an error of a mismatched shape. The error message indicates that the parameter shape is [1, 20, 2] and the flat indices being accessed are [0, 20], which is out of bounds. I'm not sure this is a low data issue used to train the model or something else. No one else seems interested in this project but me, really hard to keep advancing.

monickverma commented 5 months ago

hey, im not able to run the project since the links in the READ.md are down, what do i do

ImNotOssy commented 5 months ago

hey, im not able to run the project since the links in the READ.md are down, what do i do

I don't understand. What links?

letsgocodego commented 3 months ago

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,

restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I'm sure this question will come off as excessively elementary, but, how did you use your own data? Did you just use some data from the IAM On-Line Handwriting Database? Did you devise a method that enables you to create a file that looks like this: https://fki.tic.heia-fr.ch/static/iamondb/strokesz.xml but with your own writing sample?

Thank you!

ImNotOssy commented 3 months ago

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training, restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I'm sure this question will come off as excessively elementary, but, how did you use your own data? Did you just use some data from the IAM On-Line Handwriting Database? Did you devise a method that enables you to create a file that looks like this: https://fki.tic.heia-fr.ch/static/iamondb/strokesz.xml but with your own writing sample?

Thank you! I used this it worked great https://github.com/acmattson3/handwriting-data

npulsipher4 commented 2 months ago

fixed it, I migrated from colab to google cloud VM Instance after running out of compute units, and realizing i have $300 in credits to spend. i went all out and got myself a nice VM with blazing training speeds. I was able to get it to work after realizing the minimum step to save was way higher than my total training step. I got it to work but i feel as if my data is way way to small for the model to learn anything. i tried to run it using demo.py but i get an error of a mismatched shape. The error message indicates that the parameter shape is [1, 20, 2] and the flat indices being accessed are [0, 20], which is out of bounds. I'm not sure this is a low data issue used to train the model or something else. No one else seems interested in this project but me, really hard to keep advancing.

I'm getting the same error but I'm running it on docker locally. I got to about 2200 steps and then it stopped. Do you think I just need more data?

sjvasquez / handwriting-synthesis

Issue with saving training onto model checkpoint #84