Open bryandam opened 1 year ago
Ok, I think myth confirmed here. I've tried a handful of similar TensorFlow 2 forks and they all seem to break the actual model learning. They can use the models fine, just not build them like the original will. My best guess is that the TensorFlow 1 compat stuff is the culprit here; it compiles and runs but doesn't actually work.
For example, here's a log snippet from running the original TensorFlow 1.6 version where you've got sub-zero losses in under 300 steps:
[[09/10/2023` 07:23:30 PM]] [[step 0]] [[train 15.376s]] loss: 3.81181288 [[val 3.458s]] loss: 3.84443927
[[09/10/2023 07:32:04 PM]] [[step 20]] [[train 21.4882s]] loss: 3.79103438 [[val 3.8715s]] loss: 3.79873023
[[09/10/2023 07:42:21 PM]] [[step 40]] [[train 23.9828s]] loss: 3.68241311 [[val 4.0553s]] loss: 3.68290586
[[09/10/2023 07:53:32 PM]] [[step 60]] [[train 25.7245s]] loss: 3.55204702 [[val 4.1307s]] loss: 3.55411469
[[09/10/2023 08:06:14 PM]] [[step 80]] [[train 27.6338s]] loss: 3.43016049 [[val 4.2573s]] loss: 3.42746561
[[09/10/2023 08:18:51 PM]] [[step 100]] [[train 28.9486s]] loss: 3.30007685 [[val 4.264s]] loss: 3.29826805
[[09/10/2023 08:32:30 PM]] [[step 120]] [[train 31.8791s]] loss: 3.07375289 [[val 4.3883s]] loss: 3.06993355
[[09/10/2023 08:44:51 PM]] [[step 140]] [[train 33.0885s]] loss: 2.81778019 [[val 4.4168s]] loss: 2.8149467
[[09/10/2023 08:58:01 PM]] [[step 160]] [[train 34.2266s]] loss: 2.47351355 [[val 4.4585s]] loss: 2.46932713
[[09/10/2023 09:11:50 PM]] [[step 180]] [[train 34.9249s]] loss: 1.96372962 [[val 4.4278s]] loss: 1.96595877
[[09/10/2023 09:26:20 PM]] [[step 200]] [[train 36.0051s]] loss: 1.38146758 [[val 4.4881s]] loss: 1.38987864
[[09/10/2023 09:40:13 PM]] [[step 220]] [[train 36.0657s]] loss: 0.76426629 [[val 4.5571s]] loss: 0.76506983
[[09/10/2023 09:54:29 PM]] [[step 240]] [[train 37.0014s]] loss: 0.17377056 [[val 4.7792s]] loss: 0.1789092
[[09/10/2023 10:08:58 PM]] [[step 260]] [[train 37.5971s]] loss: -0.28618238 [[val 4.9728s]] loss: -0.27517708
.....
[[09/13/2023 01:46:07 AM]] [[step 4040]] [[train 45.3519s]] loss: -2.31338911 [[val 5.3861s]] loss: -2.28292812
[[09/13/2023 02:03:35 AM]] [[step 4060]] [[train 45.788s]] loss: -2.31954284 [[val 5.3617s]] loss: -2.28234389
[[09/13/2023 02:20:33 AM]] [[step 4080]] [[train 45.9516s]] loss: -2.32836516 [[val 5.4471s]] loss: -2.28387913
[[09/13/2023 02:37:10 AM]] [[step 4100]] [[train 46.1078s]] loss: -2.36174889 [[val 5.4438s]] loss: -2.31387565
[[09/13/2023 02:37:10 AM]] saving model to checkpoints\model
Hello,
I am also trying to learn and I'm encountering the same issue as you and tf2. Have you found a viable solution?
Thank you.
No, this repo works in so-far as it can use the model that was built with the original repo but it will not build one itself. Updating Tensor Flow from v1 to v2 is not trivial and while it compiles and runs, the changes made broke the model learning process.
I was primarily interested because I was trying to train my own model using a 3080 but I was able to achieve that using the original project though it took a whole bunch of doing that I may ... or may not ... have notes on.
No, this repo works in so-far as it can use the model that was built with the original repo but it will not build one itself. Updating Tensor Flow from v1 to v2 is not trivial and while it compiles and runs, the changes made broke the model learning process.
I was primarily interested because I was trying to train my own model using a 3080 but I was able to achieve that using the original project though it took a whole bunch of doing that I may ... or may not ... have notes on.
Thank you for your reply. How did you succeed with tensorflow v1 when you need version 9 of cuda to learn on a graphics card. And cuda 9 is not compatible with your rtx3080?
I've been trying for the better part of a week but I can't get the model to train properly and I'm starting to get skeptical that this code, as is, produced the model that Sean Vasquez included. My last run maxed out at 9100 steps but at no point did either of the training or validation losses trend downwards:
I got a checkpoint at 7580 but it's output is pure garbage unless you're really into abstract art:
Anyone have any luck getting this thing to train? I'm going to try and play with the training parameters a bit, see if some magic starts happening above 10k steps but based on the data above it's just not trending at all so I'm not confident that twice as much would make the difference there. Does anyone know if we can extract the training parameters from Sean's included model? Am I just being impatient, is this a hit-or-miss kind of thing where I just have to keep trying and wait until I get a lucky run?