sjvasquez / handwriting-synthesis

Handwriting Synthesis with RNNs ✏️
4.19k stars 557 forks source link

Issue with saving training onto model checkpoint #84

Open ImNotOssy opened 1 month ago

ImNotOssy commented 1 month ago

After training the model for hours on my own data. it seems to break since it can't save the training into a file that doesn't exist. I was using google colab for training,

restoring model from checkpoints/model-800 INFO:tensorflow:Restoring parameters from checkpoints/model-800 Restoring parameters from checkpoints/model-800 2024-05-23 11:58:55.097859: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at save_restore_tensor.cc:170 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 Traceback (most recent call last): File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/envs/py364/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/model-800 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

ImNotOssy commented 1 month ago

fixed it, I migrated from colab to google cloud VM Instance after running out of compute units, and realizing i have $300 in credits to spend. i went all out and got myself a nice VM with blazing training speeds. I was able to get it to work after realizing the minimum step to save was way higher than my total training step. I got it to work but i feel as if my data is way way to small for the model to learn anything. i tried to run it using demo.py but i get an error of a mismatched shape. The error message indicates that the parameter shape is [1, 20, 2] and the flat indices being accessed are [0, 20], which is out of bounds. I'm not sure this is a low data issue used to train the model or something else. No one else seems interested in this project but me, really hard to keep advancing.

monickverma commented 3 weeks ago

hey, im not able to run the project since the links in the READ.md are down, what do i do

ImNotOssy commented 3 weeks ago

hey, im not able to run the project since the links in the READ.md are down, what do i do

I don't understand. What links?