Closed TrentaIcedCoffee closed 3 months ago
16 x 1024 x 50K x 4bytes (float32) gives 3G, so I assume it's reasonable considering tmp variables allocated? Ahh. Guess buying more GPUs is what we do for now.
I believe most implementations will inflate the labels via a one hot encoding before computing the loss with the logits. So that's 3gb each for x
, one-hotted y
, without any temp allocations? So yeah this could very well be reasonable.
Also, be aware that by default TensorFlow can squat on all GPU memory, which can make profiling via, say, some nvidia tooling difficult. https://www.tensorflow.org/guide/gpu
As far as I can tell, on the TensorFlow backend we just delegate to a tf op here. So probably if there is any bug here we should probably open on the TensorFlow github.
I will close this for now, but if you are seeing evidence that GPU usage is much higher for Keras in particular than tf.nn.sparse_softmax_cross_entropy_with_logits
, please go ahead and reopen!
Why? And what loss_fn would you suggest when working on large number of categories that is typical in a lanugage model?