keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
60 stars 28 forks source link

tf.keras.losses on RaggedTensors crash during gradient computation on a GPU #638

Open foxik opened 2 years ago

foxik commented 2 years ago

System information.

Describe the problem.

When some loss (tf.losses.SparseCategoricalCrossentropy, tf.losses.CategoricalCrossentropy, tf.losses.BinaryCrossentropy, or tf.losses.MeanSquaredError) is used on Ragged tensors, that the gradient computation on a GPU crashes with

Node: 'Adam/gradients/zeros_like_2'
2 root error(s) found.
  (0) INTERNAL:  No unary variant unary_op function found for op ZEROS_LIKE Variant type_name: RaggedTensorVariant for device type: GPU
     [[{{node Adam/gradients/zeros_like_2}}]]
     [[binary_crossentropy/map/while/loop_body_control/_124/_67]]
  (1) INTERNAL:  No unary variant unary_op function found for op ZEROS_LIKE Variant type_name: RaggedTensorVariant for device type: GPU
     [[{{node Adam/gradients/zeros_like_2}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_16690]

Describe the current behavior.

The code crashes on a GPU. It does not crash on a CPU and it does not crash when tf.functions are executed eagerly.

Describe the expected behavior.

The code should not crash.

Standalone code to reproduce the issue.

A simple Colab reproducing the error is here: https://colab.research.google.com/drive/1OELAhvpQHhaz3sOYabf4SdBqKlQCjNjs?usp=sharing

Source code / logs.

The problem is somehow connected to the usage of ragged map in here: https://github.com/keras-team/keras/blob/2db5acf3e3c5904b014cb409d3c514bef44f9640/keras/losses.py#L1408 . My guess is that a TensorArray of ragged arrays is created and some operation for manipulating it on GPU is missing.

Note that metrics with ragged tensors work fine; but they take a different approach, and instead of a ragged map, they use flat_values, see https://github.com/keras-team/keras/blob/2db5acf3e3c5904b014cb409d3c514bef44f9640/keras/utils/metrics_utils.py#L800 .

Possible courses of action

  1. the ragged map might be fixed on TensorFlow side
  2. we might avoid using the ragged map, and use .flat_values instead, similarly to what the metrics do

Personally I like 2. more, because the problem at hand can be fixed by a "simple" solution.

foxik commented 2 years ago

Adding @pedro-r-marques who wrote the code.

foxik commented 2 years ago

On second thought, I opened an issue in the TensorFlow repository https://github.com/tensorflow/tensorflow/issues/55475 to discuss the problem with tf.map_fn on RaggedTensors -- the RaggedTensors are supported according to the documentation, so this is in fact a bug.

However, I think we could still discuss whether it would make sense to use .flat_values instead (maybe I am mistaken and it cannot be done easily; but I have implemented various models with RaggedTensors and using the .flat_values worked for me for loss computations in all of them).

tilakrayal commented 2 years ago

@gadagashwini , I was able to reproduce the issue in tf v2.7, v2.8 and nightly.Please find the gist of it here.

divyashreepathihalli commented 2 years ago

@foxik I see a similar issue here - https://github.com/tensorflow/tensorflow/issues/46635 according to @JXRiver looks like - "According to @edloper "Basically, RaggedTensorVariant objects should never be copied to GPU, because we can't do anything useful with them there. But Placer isn't currently smart enough to figure that out (it just sees a Variant tensor, and doesn't know what kind of value it contains)." We have a project going on right now that hopefully will fix the issue"

foxik commented 2 years ago

@divyashreepathihalli Thanks for pointing it out -- I have closed my report in TensorFlow repository as a duplicate of it.

This also means we need to go through the action 2. and not use a map_fn on RaggedTensors in the loss calculations. I will see if I will be able to come up with a fix.

kkm000 commented 1 year ago

I've just ran into this issue in hosted Colab and its default GPU runtime. I developed and trained a baby model on CPU at home in a Google's container image, switched to GPU in Colab to train a larger one, and kaboom. The model is built all around ragged tensors to avoid carrying and debugging masks in a mix of out-of-the box and custom layers. The affected loss is losses.CategoricalCrossentropy; I temporarily commented out all regularization losses.

A question, if you don't mind: is this problem affecting only lossess/gradients? I will write my own loss anyway; the one-hot xent is just a temporary stand-in for what I'm ultimately going to achieve. I can do dense tensors with masks in the weight only: that's a small piece of code compared to the whole caboodle. I use Keras niceties tho, like early stopping, LR scheduling and, most helpful, ReduceLROnPlateau, and am unsure if I can use them in a custom training loop, should it come to this as a workaround. Budu velmi vdĕcen za radu! :-)

In the unlikely case it makes any difference, the loss sits in the middle of the model (autoencoder-style with teacher-forcing time delay). I've it added to the functional model, and am tracing it with an explicit call, like

caxent_loss = losses.CategoricalCrossentropy(name='temp_caxent_loss_ly')(
                  y_pred=teach_1h_pred_t,
                  y_true=teach_1h_true_t)
train_time_model = keras.Model(...)
train_time_model.add_loss(caxent_loss)
train_time_model.compile(
  optimizer=keras.optimizers.Adam(...), ...)
train_time_model.fit(...)

Both backtraces point to the same place, likely the same exact one, so I'm copying the last few of only one:

File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 526, in minimize
      grads_and_vars = self.compute_gradients(loss, var_list, tape)
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 259, in compute_gradients
      grads = tape.gradient(loss, var_list)
Node: 'zeros_like_2'
2 root error(s) found.
  (0) INTERNAL:  No unary variant unary_op function found for op ZEROS_LIKE Variant type_name: RaggedTensorVariant for device type: GPU
     [[{{node zeros_like_2}}]]
     [[Func/train_time_model/tf.keras.metrics.categorical_crossentropy/map/while/body/_21/input/_357/_210]]
  (1) INTERNAL:  No unary variant unary_op function found for op ZEROS_LIKE Variant type_name: RaggedTensorVariant for device type: GPU
     [[{{node zeros_like_2}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_9351]

A boring and probably unhelpful version list that I'm always printing anyway:

------------------  ----------------------------------------------------
TF Version........  2.11.0
Keras version.....  2.11.0
Physical devices..  ['/physical_device:CPU:0', '/physical_device:GPU:0']
TF execute mode...  EAGER
matplotlib........  3.2.2
numpy.............  1.21.6
IPython kernel....  7.9.0
Jupyter client....  6.1.12
Debian version....  bullseye/sid
Linux version.....  5.10.147+ keras-team/keras#1 SMP Sat Dec 10 16:00:40 UTC 2022
------------------  ----------------------------------------------------
kkm000 commented 1 year ago

@divyashreepathihalli, this issue is possibly marked as a Duplicate in the context of Keras incorrectly. @foxik's issue in the TF repo was a duplicate of another one there, but the fix, according to your own quotation (https://github.com/keras-team/tf-keras/issues/638), looks easier on the Keras side. A full-blown, thorough handling of RaggedTensorVariants on GPU by TF seems quite a substantial work, and looks unlikely to come soon.

madsjk816 commented 1 year ago

Any progress on this issue?