Training Failure trying to Restore Model

So I've been trying to train my own model in the possibly vain hope of fixing the Q, X, and Z characters. I figure I've got a RTX 3080 so why not give it a shot? I've bought a Beam system off eBay in case I need to flesh out the dataset but I figured step one was to just go through and make sure I could successfully do a training run to replicate the current behavior.

At first I figured I should clear out the existing model by clearing out the model/checkpoint folder. That went fine until around set 1520:

[[step     1480]]     [[train 5.898s]]     loss: 3.8656888        [[val 1.8396s]]     loss: 3.86674941
[[step     1500]]     [[train 5.9107s]]     loss: 3.86648588       [[val 1.8238s]]     loss: 3.8679687
[[step     1520]]     [[train 5.9829s]]     loss: 3.86527902       [[val 1.8333s]]     loss: 3.86707894
restoring model parameters from None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/training/train.py", line 32, in train
    nn.fit()
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 280, in fit
    self.restore(best_validation_tstep)
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 356, in restore
    saver.restore(self.session, model_path)
  File "/home/bdam/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1406, in restore
    raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.

It struck me as odd that you'd need the existing model to create a new model as that seems like a catch-22 but ... ok ... the instructions didn't say to clear that directory so I restored the files there and tried again. However, I was met with a very similar error:

[[step     1500]]     [[train 6.8473s]]     loss: 3.78012583       [[val 1.8517s]]     loss: 3.78005823
[[step     1520]]     [[train 6.8115s]]     loss: 3.77934692       [[val 1.8688s]]     loss: 3.78114267
[[step     1540]]     [[train 6.9358s]]     loss: 3.77841039       [[val 1.8917s]]     loss: 3.7784446
restoring model from model/checkpoint/model-20
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/training/train.py", line 32, in train
    nn.fit()
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 280, in fit
    self.restore(best_validation_tstep)
  File "/home/bdam/handwriting-synthesis/handwriting_synthesis/tf/BaseModel.py", line 362, in restore
    saver.restore(self.session, model_path)
  File "/home/bdam/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1410, in restore
    raise ValueError("The passed save_path is not a valid checkpoint: " +
ValueError: The passed save_path is not a valid checkpoint: model/checkpoint/model-20

This second time it's acting as if the BaseModel.Restore() function received a step (Step 20) as it's calculating a folder caled 'model-20' but that folder doesn't exist.

Near as I can tell then, it feels like BaseModel.Restore() is being called before BaseModel.Save() is called. Any thoughts @otuva?

Lastly, any guidance on real-world training performance? It supposedly took a couple of days on the Tesla K80 which, in theory, my RTX 3080 should beat by a fair margin. However, based on how long it took to do those 1540 steps I'm looking at 8+ days here. I'm running it via Windows 11 WSL on Ubuntu 22.04.2 LTS and the GPU is being used:

2023-09-05 08:07:20.616918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7335 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6

I wonder if WSL presents some kind of bottleneck and I'll need to dual-boot to get better perf. Anyone have experience to share here?

otuva / handwriting-synthesis

Training Failure trying to Restore Model #15