Closed ripjohnbrown1859 closed 1 week ago
@ripjohnbrown1859, Could you please provide the steps you have followed to install the tensorflow 2.16 and also provide the CUDA, cudNN, Bazel, Environment where you are trying which helps us to analyse the issue in an effective way. Thank you!
i managed to fix this specific problem by processing the data on the cpu and under 'with tf.device('/CPU:0'):' and training under 'with strategy'. now i have a problem where every other epoch reports a bunch of rendezvous errors, skips abunch of data, and gives a val accuracy of 1 and an increasingly high loss. also my epochs are reporting greater than 1 accuracy. Should i open a new issue?
@ripjohnbrown1859, Glad the GPU issue was resolved. For the val loss, If the validation loss (error) is going to increase so means overfitting. You must set the number of epochs as high as possible and avoid the overfitting and terminate training based on the error rates. . As long as it keeps dropping training should continue. Till model start to converge at nth epochs. Indeed it should converge quite well to a low val_loss.
Also please take a look at this references. https://discuss.tensorflow.org/t/why-does-my-validation-loss-increase-but-validation-accuracy-perfectly-matches-training-accuracy/4283
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
tf 2.16.1
Custom code
Yes
OS platform and distribution
wsl ubuntu 22.04
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
8.6.0
GPU model and memory
sli titan x maxwell
Current behavior?
i have 2 titan x maxwells and am trying to run a CNN on my machine in wsl2, however when I try to run it i get the attached error, which seems to indicate tensorflow is not using any gpu memory and then throwing an error. it runs out of memory while converting the training input tensor
Standalone code to reproduce the issue
Relevant log output