Open csteel45 opened 3 years ago
Update: I downgraded the Amazon Linux instance to Cuda 10.2 and Tensorflow 2.3 and still have the same problem. I had seen in the release notes that Cuda 10.2 had fixed an issue with very large batches and mixed precision, but it doesn't fix this.
@sanatmpa1,
Can you please take a look at the tested build configurations which shows the corresponding CUDA
version for each version of Tensorflow
. Also please share the fully reproducible stand alone code to expedite the trouble shooting process. Thanks!
Tested with with following configurations on Windows 10 and AWS Linux (RHEL7) with hardware as described above: Tensorflow 2.60 Python 3.8.5 CuDNN: 8.1 Cuda 11.2 Pip: 20.2.4
Full source not available, relevant source: def create_generator(self): self.logger.debug('called')
inputs = tf.keras.Input(shape=(self._noise_dim,))
dense = Dense(128 * self.dim * self.dim,
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
x = dense(inputs)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = Reshape((self.dim, self.dim, 128))(x)
x = UpSampling2D(size=(self.upSamp, self.upSamp))(x)
x = Conv2D(64, kernel_size=(5, 5), padding='same')(x)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = UpSampling2D(size=(self.upSamp, self.upSamp))(x)
x = Conv2D(1, kernel_size=(5, 5), padding='same')(x)
x = BatchNormalization(momentum=0.8)(x)
x = Flatten()(x)
outputs = Dense(self.feature_size, activation='tanh', dtype=np.float32)(x) # All output between -1 and 1
generator = tf.keras.Model(inputs=inputs, outputs=outputs, name='Generator')
generator.compile(loss=self.generator_loss_function, optimizer=self.generator_optimizer)
return generator
def create_discriminator(self):
self.logger.debug('called')
inputs = tf.keras.Input(shape=(self.feature_size,))
dense = Dense(self.dim * self.dim * 32, input_shape=(self.feature_size,),
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
x = dense(inputs)
x = Reshape((self.dim, self.dim, 32))(x)
# x = Conv2D(128, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
# x = BatchNormalization(momentum=0.8)(x)
# x = LeakyReLU(0.2)(x)
# x = Dropout(rate=0.3)(x)
x = Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
# discriminator.add(BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = Dropout(rate=0.25)(x)
x = Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
# x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = Dropout(rate=0.25)(x)
x = Flatten()(x)
outputs = Dense(1, activation='sigmoid', dtype=np.float32)(x) # Original + dtype for mixed-precision
# x= Dense(1)(x) # https://medium.com/swlh/gan-generative-adversarial-network-3706ebfef77e
discriminator = tf.keras.Model(inputs = inputs, outputs=outputs, name='Discriminator')
discriminator.compile(loss=self.discriminator_loss_function,
optimizer=self.discriminator_optimizer, metrics=['accuracy'])
return discriminator
def discriminator_loss(self, real_output, synthetic_output):
# https://stackoverflow.com/questions/55936611/why-doesnt-the-discriminators-and-generators-loss-change
cross_entropy = tf.keras.losses.BinaryCrossentropy()
real_loss = cross_entropy(tf.ones_like(real_output), real_output)
fake_loss = cross_entropy(tf.zeros_like(synthetic_output), synthetic_output)
total_loss = (real_loss + fake_loss) / 2
return total_loss
def generator_loss(self, fake_output) -> object:
# https://stackoverflow.com/questions/55936611/why-doesnt-the-discriminators-and-generators-loss-change
cross_entropy = tf.keras.losses.BinaryCrossentropy()
gen_loss = cross_entropy(tf.ones_like(fake_output), fake_output)
return gen_loss
# Notice the use of `tf.function`. This annotation causes the function to be "compiled".
@tf.function()
def _train_gen(self, real_data):
#self.logger.info('Called')
noise = tf.random.normal([self._batch_size, self._noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as _:
# Generator creates synthetic _data
synthetic_data = self._generator(noise, training=True)
# Discriminator predicts on real _data
real_data_pred = self._discriminator(real_data, training=False)
# Discriminator predicts on synthetic _data
synth_data_pred = self._discriminator(synthetic_data, training=False)
# Calculate Generator loss based on Discriminator's predictions of synthetic _data
gen_loss = self.generator_loss(synth_data_pred)
# Calculate Discriminator loss based on Discriminator's predictions of real and synthetic _data
disc_loss = self.discriminator_loss(real_data_pred, synth_data_pred)
gradients_of_generator = gen_tape.gradient(gen_loss, self._generator.trainable_variables)
self.generator_optimizer.apply_gradients(zip(gradients_of_generator, self._generator.trainable_variables))
return gen_loss, disc_loss
# Notice the use of `tf.function`. This annotation causes the function to be "compiled".
@tf.function()
def _train_step(self, real_data):
#self.logger.info('Called')
noise = tf.random.normal([self._batch_size, self._noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
# Generator creates synthetic _data
synthetic_data = self._generator(noise, training=True)
# Discriminator predicts on real _data
real_data_pred = self._discriminator(real_data, training=True)
# Discriminator predicts on synthetic _data
synth_data_pred = self._discriminator(synthetic_data, training=True)
# Calculate Generator loss based on Discriminator's predictions of synthetic _data
gen_loss = self.generator_loss(synth_data_pred)
# Calculate Discriminator loss based on Discriminator's predictions of real and synthetic _data
disc_loss = self.discriminator_loss(real_data_pred, synth_data_pred)
gradients_of_generator = gen_tape.gradient(gen_loss, self._generator.trainable_variables)
gradients_of_discriminator = disc_tape.gradient(disc_loss, self._discriminator.trainable_variables)
self.generator_optimizer.apply_gradients(zip(gradients_of_generator, self._generator.trainable_variables))
self.discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator,
self._discriminator.trainable_variables))
return gen_loss, disc_loss
def train(self, job: Job, epochs=None) -> None:
self.job = job;
start = time.time()
update_time = start + 60*5 # 5 minutes
if epochs is None:
epochs = job.db_job.epochs
update_calc = lambda x: int(epochs / 10) if (int(epochs / 10) > 0) else 1
update_rate = update_calc(epochs)
# assert not np.any(np.isnan(data.normalized.values))
self._batch_size = self._round_batch_size(job.db_job.batch_size)
if len(self.data.normalized) < self._batch_size:
self._batch_size = self._round_batch_size(len(self.data.normalized))
self.logger.info(f'Specified batch_size greater than sample size: changing to {self._batch_size}')
self.logger.debug(f'Starting training of {epochs} epochs with batch size of {self._batch_size}')
self.logger.info(f'data.normalized.shape[0] = {self.data.normalized.shape[0]}')
num_batches = int(len(self.data.normalized) / self._batch_size)
gen_count = 0
dis_count = 0
disc_loss = 0.0
gen_loss = 0.0
for epoch in range(epochs):
for batch_num in range(num_batches): # Train on all data in each epoch
# indexes = randint(0, self.data.normalized.shape[0], batch_size) # Generate non-contiguous random indexes
if self._batch_size > self.data.normalized.shape[0]:
index = randint(0, self.data.normalized.shape[0] - self._batch_size) # Generate contiguous random indexes
else:
index = 0
real_data = self.data.normalized.values[index: index+self._batch_size]
# self.logger.debug(f'Batch: {batch_num*batch_size} : {(batch_num*batch_size + batch_size)}')
gen_loss, disc_loss = self._train_step(real_data)
# else:
# if balance and gen_loss > disc_loss: # Train just gen
# gen_loss, disc_loss = self._train_gen(real_data)
# gen_count += 1
# else: # Train both again
# gen_loss, disc_loss = self._train_step(real_data)
# dis_count += 1
# Save the model every periodically
if epoch % update_rate == 0 or time.time() > update_time:
# etime = (time.time() - start)
update_time = time.time() + 5 * 60 # 5 minutes
self.logger.debug(f'Epoch: {epoch} \tgen loss: {gen_loss:.7f} \tdisc loss: {disc_loss:.7f} \t'
f'gen count: {gen_count} \tdis_count: {dis_count}')
self.update_status(f'training epoch {epoch}')
# self.logger.debug(f'Epoch: {epoch}\t time: {etime}\tDLoss = {d_loss:.4f} GLoss = {g_loss:.4f}')
# self.checkpoint.save(file_prefix=self.checkpoint_prefix)
if epoch > 100:
self.save()
self.logger.debug(f'Final gen loss: {gen_loss} disc loss: {disc_loss}')
# self.checkpoint.save(file_prefix=self.checkpoint_prefix)
job.db_job.training_time = int(time.time() - start) # Round to nearest second
self.update_status(JobStatus.trained)
self.save() # Save the generator and discriminator models
self._trained = True
# self.checkpoint.save(file_prefix=self.checkpoint_prefix)
Run with batch size of 10240 or higher.
I think I found the problem. Looking through the TF source, the function CHECK_GT takes int32 as the arguments. For my model, the tensor size for batch of 1024 is [1024,250,250,64] which totals: 4,096,000,000. If I cast that to an int32, the resulting value is -198967296 which is what I see in the console when the program dies: .\tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-198967296 vs. 0) Since modern GPUs can handle massive models (my Titan RTX has 24GB), you need to change the int arguments to long in CHECK_GT (and similar functions).
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.
I would like some feedback on my suggestion. I can put together a standalone test case if necessary, but it should be easy to guess that using int32 for the parameter size checks is going to cause problems as models get larger and GPU memory continues to increase.
Adding @rohan100jain from tf.core team for this issue. From the latest message, it might be a int32 overflow issue somewhere in TF.
In the meantime, please provide a some reproducible example. Thanks.
It is without question an int32 overflow problem. Please look at the function
inline GpuLaunchConfig GetGpuLaunchConfig(int work_element_count, const Eigen::GpuDevice& d)
in tensorflow/core/util/gpu_launch_config.h. It takes an int as an argument. When work_element_count exceeds the system limit for int, the program aborts. You can check by calling it with a work_element_count of value 2200000000 or greater on an int32 system.
import tensorflow as tf
from tensorflow.keras.layers import Reshape, Dense, BatchNormalization, Flatten, LeakyReLU, UpSampling2D, Conv2D
from tensorflow.keras.initializers import he_uniform
import numpy as np
noise_dim = 128
dim = 10
upSamp = 5
feature_size = 328
batch_size = 1024
kernel_init = he_uniform()
inputs = tf.keras.Input(shape=(noise_dim,))
dense = Dense(128 * dim * dim, kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
x = dense(inputs)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = Reshape((dim, dim, 128))(x)
x = UpSampling2D(size=(upSamp, upSamp))(x)
x = Conv2D(64, kernel_size=(5, 5), padding='same', kernel_initializer = kernel_init)(x)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = UpSampling2D(size=(upSamp, upSamp))(x)
x = Conv2D(1, kernel_size=(5, 5), padding='same', kernel_initializer = kernel_init)(x)
x = BatchNormalization(momentum=0.8)(x)
x = Flatten()(x)
outputs = Dense(feature_size, activation='tanh', dtype=np.float32)(x) # All output between -1 and 1
generator = tf.keras.Model(inputs=inputs, outputs=outputs, name='Generator')
noise = tf.random.normal([batch_size, noise_dim])
@tf.function()
def train_step(data):
with tf.GradientTape() as gen_tape:
synthetic_data = generator(noise, training=True)
data = tf.random.normal([batch_size, feature_size])
train_step(data)
Make sure you run the code above on a GPU with enough memory that you don't get an OOM exception. This code was tested on an NVidia Titan RTX with 24GB of GPU memory.
Please let me know if the sample code above reproduced the problem and if it is helpful.
@csteel45 Thanks for posting this - I'm running into the same issue. Did you happen to find a work around? (Other than reducing the shape of the tensor in question)
I'm running into a similar issue on 80GB A100 DGX machines.
I did not find a work-around other modifying gpu_launch_config.h and recompiling or reducing the tensor size. I am surprised they haven't fixed this yet.
Thanks, isgpu_launch_config.h
the only place this needs editing?
Just to be clear, @csteel45, you made this change to tensorflow/core/util/gpu_launch_config.h
:
template <typename DeviceFunc>
GpuLaunchConfig GetGpuLaunchConfig(int64_t work_element_count,
const Eigen::GpuDevice& d, DeviceFunc func,
size_t dynamic_shared_memory_size,
int block_size_limit) {
CHECK_GT(work_element_count, 0);
GpuLaunchConfig config;
Apologies, it has been a while and I have moved on from that model. I really can't remember if there were any other changes. I did get it working, so there is a way. Try that change and it if fails, checks the messages and make updates elsewhere where required. This was for a POC so I didn't track any changes or take any notes, I just hacked away until it worked. I'm not sure I even have the fixed version lying around.
Any update about this issue?
I still run into this issue.
My Environments:
The model is implemented using Keras API, and uses TF Dataset to load data.
Any update about this issue?
I still run into this issue.
My Environments:
- CUDA11.4
- NVIDIA A100 (80GB)
- TensorFlow 2.9.0
The model is implemented using Keras API, and uses TF Dataset to load data.
In my implementation, I used VarLenFeature
and used to_dense
to convert SparseTensor to DenseTensor, this might cause memory leaky. After I changed to use FixedLenFeature
, the issue was fixed.
could you tell me how you solve this problem, i start to learn this so dont understand it.
@SeanLee97
I run into the same issue on A100_80G GPU. is there any update on this ?
I run into the same issue on A100_80G GPU. is there any update on this ?
@yufang67 it might be the memory issue. try to decrease batch size.
@SeanLee97 Thanks. Yes, decrease batch size can run without the issue, but in this case we cant fully utilize the GPU memory. i have run successfully on A100_40G and when switching to A100_80G i increases the batch size around 30%. So it should be not a OOM issue. (if your mentioned memory issue is OOM :) )
Apologies for the late response. This is exactly a memory issue. The Keras team is using a 32-bit variable to store work_element_count instead of a 64-bit. This limits the number of elements that can be stored, preventing you from taking advantage of the large memory capacity of GPUs like the Titan RTX. The work around is to use smaller batch sizes and such, but then you can’t maximize your training capability.
-Chris
From: yufang67 @.> Sent: Thursday, March 2, 2023 9:20 AM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)
@SeanLee97 https://github.com/SeanLee97 Thanks. Yes, decrease batch size can run without the issue, but in this case we cant fully utilize the GPU memory. i have run successfully on A100_40G and when switching to A100_80G i increases the batch size around 30%. So it should be not a OOM issue. (if your mentioned memory issue is OOM :) )
— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTLUVBQKMTDK23Q2QR3W2CUCTANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTL3B7GMALIEVVJB3ZTW2CUCTA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2FOKFA.gif Message ID: @. @.> >
Hi @csteel45, thanks for response. i tried to fix int32 definition in tensorflow/core/util/gpu_launch_config.h in tf2.9.3. Now i dont have work_element_count > 0 error, but i obtained some overflow in gradient during backprop. Do you have any idea of this ?
File "/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 471, in _get_gradients grads = tape.gradient(loss, var_list, grad_loss) Node: 'gradient_tape/rnn_transducer/prediction/lstm/while/gradients/rnn_transducer/prediction/lstm/while/lstm_cell/mul_grad/BroadcastGradientArgs' Incompatible shapes: [0,-2147450880] vs. [-2147483648,32768]
It has been a few years since I patched my version of TF but I recall that you have to look for all of the places work_element_count is used and make sure that all of the variables that it is being copied into are int64 as well. It was a lot of trial and error and took me 3 or 4 days to get it patched to the point it was working for my model.
-Chris
From: yufang67 @.> Sent: Thursday, March 2, 2023 2:06 PM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)
Hi @csteel45 https://github.com/csteel45 , thanks for response. i tried to fix int32 definition in tensorflow/core/util/gpu_launch_config.h in tf2.9.3. Now i dont have work_element_count > 0 error, but i obtained some overflow in gradient during backprop. Do you have any idea of this ?
File "/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 471, in _get_gradients grads = tape.gradient(loss, var_list, grad_loss) Node: 'gradient_tape/rnn_transducer/prediction/lstm/while/gradients/rnn_transducer/prediction/lstm/while/lstm_cell/mul_grad/BroadcastGradientArgs' Incompatible shapes: [0,-2147450880] vs. [-2147483648,32768]
— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTJM77O6CMJESGGG3R3W2DVSHANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTJS652KK7RL23WXJJTW2DVSHA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2JA7HI.gif Message ID: @. @.> >
@csteel45 Thanks for the suggestion. i will try it. which tf version you patched ?
Maybe 1.13 if I remember, it has been a while.
-Chris
From: yufang67 @.> Sent: Thursday, March 2, 2023 2:14 PM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)
@csteel45 https://github.com/csteel45 Thanks for the suggestion. i will try it. which tf version you patched ?
— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTOVX45C2FNTCTA4C53W2DWNZANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTMYW4ZJR3WSNTVWVDDW2DWNZA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2JDFRY.gif Message ID: @. @.> >
We have a patch for TF 2.4.4, but it doesn't work properly (training divergence) on 2.9 or later versions, probably due to potential refactoring differences between TF 2.9 and TF2.4.4. The issue is that TF can't utilize the increased memory capacity of newer GPUs (80G), which is a significant limitation, it's surprising (at least to me) that the TensorFlow team hasn't prioritized a solution for this. We can share our patch to TF2.4.4 as an PR if this helps.
I also met this problems. I use tf1.15.5, and batch is large, can tf development engineers help fix it?
@bit-pku-zdf , We no longer support TensorFlow 1.x related issues, use the latest version of TensorFlow for the latest error updates. You can also use Keras 3 with TensorFlow backend, for more details refer https://keras.io/guides/migrating_to_keras_3/
I have written a custom Keras CNN-based GAN for synthesizing tabular datasets. The code works fine when I use reasonable batch size (generally 64 to 1024). However, users are allowed to specify a batch size and when they use large ones, I try to handle by catching ResourceExhaustedErrors and step the batch size down. I found that doing this, eventually leads to Check failed error in the post title and I can't catch the exception, the process just dies. This occurs using the following environments:
Windows Tensorflow 2.6.0 (pip install) Cuda 11.3 Titan RTX 24GB Founders Edition card Driver: 465.89
AWS Linux (RHEL7) Tensorflow 2.6.1 (pip install) Cuda 11.3 and now 11.5 V100 Driver: 495.29.05
Batch shape is (10240, 328) so at least 2 samples (per related post). train_step function: @tf.function() def _train_step(self, real_data): noise = tf.random.normal([self._batch_size, self._noise_dim])
Also, happens with and without mixed precision. Any help would be greatly appreciated.