otuva / handwriting-synthesis

Handwriting Synthesis with RNNs ✏️ (Migrated to TensorFlow v2)
68 stars 31 forks source link

Training Failure trying to Restore Model #15

Closed bryandam closed 11 months ago

bryandam commented 11 months ago

So I've been trying to train my own model in the possibly vain hope of fixing the Q, X, and Z characters. I figure I've got a RTX 3080 so why not give it a shot? I've bought a Beam system off eBay in case I need to flesh out the dataset but I figured step one was to just go through and make sure I could successfully do a training run to replicate the current behavior.

At first I figured I should clear out the existing model by clearing out the model/checkpoint folder. That went fine until around set 1520:

[[step     1480]]     [[train 5.898s]]     loss: 3.8656888        [[val 1.8396s]]     loss: 3.86674941
[[step     1500]]     [[train 5.9107s]]     loss: 3.86648588       [[val 1.8238s]]     loss: 3.8679687
[[step     1520]]     [[train 5.9829s]]     loss: 3.86527902       [[val 1.8333s]]     loss: 3.86707894
restoring model parameters from None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/training/train.py", line 32, in train
    nn.fit()
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 280, in fit
    self.restore(best_validation_tstep)
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 356, in restore
    saver.restore(self.session, model_path)
  File "/home/bdam/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1406, in restore
    raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.

It struck me as odd that you'd need the existing model to create a new model as that seems like a catch-22 but ... ok ... the instructions didn't say to clear that directory so I restored the files there and tried again. However, I was met with a very similar error:

[[step     1500]]     [[train 6.8473s]]     loss: 3.78012583       [[val 1.8517s]]     loss: 3.78005823
[[step     1520]]     [[train 6.8115s]]     loss: 3.77934692       [[val 1.8688s]]     loss: 3.78114267
[[step     1540]]     [[train 6.9358s]]     loss: 3.77841039       [[val 1.8917s]]     loss: 3.7784446
restoring model from model/checkpoint/model-20
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/training/train.py", line 32, in train
    nn.fit()
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 280, in fit
    self.restore(best_validation_tstep)
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 362, in restore
    saver.restore(self.session, model_path)
  File "/home/bdam/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1410, in restore
    raise ValueError("The passed save_path is not a valid checkpoint: " +
ValueError: The passed save_path is not a valid checkpoint: model/checkpoint/model-20

This second time it's acting as if the BaseModel.Restore() function received a step (Step 20) as it's calculating a folder caled 'model-20' but that folder doesn't exist.

Near as I can tell then, it feels like BaseModel.Restore() is being called before BaseModel.Save() is called. Any thoughts @otuva?

Lastly, any guidance on real-world training performance? It supposedly took a couple of days on the Tesla K80 which, in theory, my RTX 3080 should beat by a fair margin. However, based on how long it took to do those 1540 steps I'm looking at 8+ days here. I'm running it via Windows 11 WSL on Ubuntu 22.04.2 LTS and the GPU is being used:

2023-09-05 08:07:20.616918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7335 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6

I wonder if WSL presents some kind of bottleneck and I'll need to dual-boot to get better perf. Anyone have experience to share here?

bryandam commented 11 months ago

Ok, I figured out what was going on with the snapshot restore and have a PR in to fix it: #16

I also think I understand the perf issue thing in that I don't think there's an actual perf issue. The train() method passes in 100,000 for minnum_training_steps and on my RTX3080 I was averaging around 8.5s per step. However, if you look at the model that's included in the code it apparently only took 17,900 steps based on the checkpoint files. That is to say, it shouldn't take 100k steps to train the model.