Model Seemingly Refuses to Learn?

bryandam commented 1 year ago

I've been trying for the better part of a week but I can't get the model to train properly and I'm starting to get skeptical that this code, as is, produced the model that Sean Vasquez included. My last run maxed out at 9100 steps but at no point did either of the training or validation losses trend downwards:

I got a checkpoint at 7580 but it's output is pure garbage unless you're really into abstract art:

Anyone have any luck getting this thing to train? I'm going to try and play with the training parameters a bit, see if some magic starts happening above 10k steps but based on the data above it's just not trending at all so I'm not confident that twice as much would make the difference there. Does anyone know if we can extract the training parameters from Sean's included model? Am I just being impatient, is this a hit-or-miss kind of thing where I just have to keep trying and wait until I get a lucky run?

bryandam commented 1 year ago

Ok, I think myth confirmed here. I've tried a handful of similar TensorFlow 2 forks and they all seem to break the actual model learning. They can use the models fine, just not build them like the original will. My best guess is that the TensorFlow 1 compat stuff is the culprit here; it compiles and runs but doesn't actually work.

For example, here's a log snippet from running the original TensorFlow 1.6 version where you've got sub-zero losses in under 300 steps:

[[09/10/2023` 07:23:30 PM]] [[step        0]]     [[train 15.376s]]     loss: 3.81181288       [[val 3.458s]]     loss: 3.84443927       
[[09/10/2023 07:32:04 PM]] [[step       20]]     [[train 21.4882s]]     loss: 3.79103438       [[val 3.8715s]]     loss: 3.79873023       
[[09/10/2023 07:42:21 PM]] [[step       40]]     [[train 23.9828s]]     loss: 3.68241311       [[val 4.0553s]]     loss: 3.68290586       
[[09/10/2023 07:53:32 PM]] [[step       60]]     [[train 25.7245s]]     loss: 3.55204702       [[val 4.1307s]]     loss: 3.55411469       
[[09/10/2023 08:06:14 PM]] [[step       80]]     [[train 27.6338s]]     loss: 3.43016049       [[val 4.2573s]]     loss: 3.42746561       
[[09/10/2023 08:18:51 PM]] [[step      100]]     [[train 28.9486s]]     loss: 3.30007685       [[val 4.264s]]     loss: 3.29826805       
[[09/10/2023 08:32:30 PM]] [[step      120]]     [[train 31.8791s]]     loss: 3.07375289       [[val 4.3883s]]     loss: 3.06993355       
[[09/10/2023 08:44:51 PM]] [[step      140]]     [[train 33.0885s]]     loss: 2.81778019       [[val 4.4168s]]     loss: 2.8149467        
[[09/10/2023 08:58:01 PM]] [[step      160]]     [[train 34.2266s]]     loss: 2.47351355       [[val 4.4585s]]     loss: 2.46932713       
[[09/10/2023 09:11:50 PM]] [[step      180]]     [[train 34.9249s]]     loss: 1.96372962       [[val 4.4278s]]     loss: 1.96595877       
[[09/10/2023 09:26:20 PM]] [[step      200]]     [[train 36.0051s]]     loss: 1.38146758       [[val 4.4881s]]     loss: 1.38987864       
[[09/10/2023 09:40:13 PM]] [[step      220]]     [[train 36.0657s]]     loss: 0.76426629       [[val 4.5571s]]     loss: 0.76506983       
[[09/10/2023 09:54:29 PM]] [[step      240]]     [[train 37.0014s]]     loss: 0.17377056       [[val 4.7792s]]     loss: 0.1789092        
[[09/10/2023 10:08:58 PM]] [[step      260]]     [[train 37.5971s]]     loss: -0.28618238      [[val 4.9728s]]     loss: -0.27517708
.....
[[09/13/2023 01:46:07 AM]] [[step     4040]]     [[train 45.3519s]]     loss: -2.31338911      [[val 5.3861s]]     loss: -2.28292812      
[[09/13/2023 02:03:35 AM]] [[step     4060]]     [[train 45.788s]]     loss: -2.31954284      [[val 5.3617s]]     loss: -2.28234389      
[[09/13/2023 02:20:33 AM]] [[step     4080]]     [[train 45.9516s]]     loss: -2.32836516      [[val 5.4471s]]     loss: -2.28387913      
[[09/13/2023 02:37:10 AM]] [[step     4100]]     [[train 46.1078s]]     loss: -2.36174889      [[val 5.4438s]]     loss: -2.31387565      
[[09/13/2023 02:37:10 AM]] saving model to checkpoints\model

MathisM51 commented 7 months ago

Hello,

I am also trying to learn and I'm encountering the same issue as you and tf2. Have you found a viable solution?

Thank you.

bryandam commented 7 months ago

No, this repo works in so-far as it can use the model that was built with the original repo but it will not build one itself. Updating Tensor Flow from v1 to v2 is not trivial and while it compiles and runs, the changes made broke the model learning process.

I was primarily interested because I was trying to train my own model using a 3080 but I was able to achieve that using the original project though it took a whole bunch of doing that I may ... or may not ... have notes on.

MathisM51 commented 6 months ago

No, this repo works in so-far as it can use the model that was built with the original repo but it will not build one itself. Updating Tensor Flow from v1 to v2 is not trivial and while it compiles and runs, the changes made broke the model learning process.

I was primarily interested because I was trying to train my own model using a 3080 but I was able to achieve that using the original project though it took a whole bunch of doing that I may ... or may not ... have notes on.

Thank you for your reply. How did you succeed with tensorflow v1 when you need version 9 of cuda to learn on a graphics card. And cuda 9 is not compatible with your rtx3080?

otuva / handwriting-synthesis

Model Seemingly Refuses to Learn? #17