keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.16k stars 19.49k forks source link

Kernel crash in optimizer.apply_gradient for complex-valued gradients #20581

Open jhoydis opened 1 day ago

jhoydis commented 1 day ago

TensorFlow is able to correctly compute gradients for complex-valued variables. However, the Keras3 optimizers do not seem to be able to correctly apply complex-valued gradients. This worked with Keras 2.

Here is a code snippet that works in TF2.15, but leads to a Kernel crash with Keras 3.7 and TF 2.18. The crash is caused by the function optimizer.apply_gradients.

import tensorflow as tf

# Complex-valued variable
x = tf.Variable(tf.complex(3., 2.), trainable=True)
optimizer = tf.keras.optimizers.SGD()
with tf.GradientTape() as tape:
    loss = tf.abs(x)**2
grads = tape.gradient(loss, tape.watched_variables())
optimizer.apply_gradients(zip(grads, tape.watched_variables()))
print(x)

# Real-valued variable equivalent
x_r = tf.Variable(3., trainable=True)
x_i = tf.Variable(2., trainable=True)
optimizer = tf.keras.optimizers.SGD()
with tf.GradientTape() as tape:
    x = tf.complex(x_r, x_i)
    loss = tf.abs(x)**2
grads = tape.gradient(loss, tape.watched_variables())
optimizer.apply_gradients(zip(grads, tape.watched_variables()))
print(tf.complex(x_r, x_i))
mehtamansi29 commented 1 day ago

Hi @jhoydis -

Thanks for reporting the issue. I am not able to reproduce any kernel crash using complex-valued gradient with Keras 3.7 and TF 2.18 version. Attached gist for your reference.

jhoydis commented 13 hours ago

Hi @mehtamansi29,

Thanks for looking into this so rapidly.

When I run this code on GPU (using Colab and a T4 GPU instance) the kernel crashes.

mehtamansi29 commented 4 hours ago

Hi @jhoydis -

I am also reproduce this issue on GPU (using Colab and a T4 GPU instance). After seeing logs from crash runtime it seems that TensorFlow is overriding a memory allocation setting due to the TF_FORCE_GPU_ALLOW_GROWTH environment variable being set.

And also TensorFlow build might be missing some CPU optimization flags.

I0000 00:00:1733326291.211008 13282 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13949 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2024-12-04 15:31:31.210156: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

We will dig into the issue and update here.