tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0)

csteel45 commented 3 years ago

I have written a custom Keras CNN-based GAN for synthesizing tabular datasets. The code works fine when I use reasonable batch size (generally 64 to 1024). However, users are allowed to specify a batch size and when they use large ones, I try to handle by catching ResourceExhaustedErrors and step the batch size down. I found that doing this, eventually leads to Check failed error in the post title and I can't catch the exception, the process just dies. This occurs using the following environments:

Windows Tensorflow 2.6.0 (pip install) Cuda 11.3 Titan RTX 24GB Founders Edition card Driver: 465.89

AWS Linux (RHEL7) Tensorflow 2.6.1 (pip install) Cuda 11.3 and now 11.5 V100 Driver: 495.29.05

Batch shape is (10240, 328) so at least 2 samples (per related post). train_step function: @tf.function() def _train_step(self, real_data): noise = tf.random.normal([self._batch_size, self._noise_dim])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
        synthetic_data = self._generator(noise, training=True)
        real_data_pred = self._discriminator(real_data, training=True)
        synth_data_pred = self._discriminator(synthetic_data, training=True)
        gen_loss = self.generator_loss(synth_data_pred)
        disc_loss = self.discriminator_loss(real_data_pred, synth_data_pred)

    gradients_of_generator = gen_tape.gradient(gen_loss, self._generator.trainable_variables)
    gradients_of_discriminator = disc_tape.gradient(disc_loss, self._discriminator.trainable_variables)
    self.generator_optimizer.apply_gradients(zip(gradients_of_generator, self._generator.trainable_variables))
    self.discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator,
                                                     self._discriminator.trainable_variables))

Also, happens with and without mixed precision. Any help would be greatly appreciated.

csteel45 commented 3 years ago

Update: I downgraded the Amazon Linux instance to Cuda 10.2 and Tensorflow 2.3 and still have the same problem. I had seen in the release notes that Cuda 10.2 had fixed an issue with very large batches and mixed precision, but it doesn't fix this.

sanatmpa1 commented 2 years ago

@sanatmpa1,

Can you please take a look at the tested build configurations which shows the corresponding CUDA version for each version of Tensorflow. Also please share the fully reproducible stand alone code to expedite the trouble shooting process. Thanks!

csteel45 commented 2 years ago

Tested with with following configurations on Windows 10 and AWS Linux (RHEL7) with hardware as described above: Tensorflow 2.60 Python 3.8.5 CuDNN: 8.1 Cuda 11.2 Pip: 20.2.4

Full source not available, relevant source: def create_generator(self): self.logger.debug('called')

    inputs = tf.keras.Input(shape=(self._noise_dim,))
    dense = Dense(128 * self.dim * self.dim,
                  kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
    x = dense(inputs)
    x = BatchNormalization(momentum=0.8)(x)
    x = LeakyReLU(0.2)(x)
    x = Reshape((self.dim, self.dim, 128))(x)
    x = UpSampling2D(size=(self.upSamp, self.upSamp))(x)
    x = Conv2D(64, kernel_size=(5, 5), padding='same')(x)
    x = BatchNormalization(momentum=0.8)(x)
    x = LeakyReLU(0.2)(x)
    x = UpSampling2D(size=(self.upSamp, self.upSamp))(x)
    x = Conv2D(1, kernel_size=(5, 5), padding='same')(x)
    x = BatchNormalization(momentum=0.8)(x)
    x = Flatten()(x)
    outputs = Dense(self.feature_size, activation='tanh', dtype=np.float32)(x)  # All output between -1 and 1

    generator = tf.keras.Model(inputs=inputs, outputs=outputs, name='Generator')
    generator.compile(loss=self.generator_loss_function, optimizer=self.generator_optimizer)

    return generator

def create_discriminator(self):
    self.logger.debug('called')

    inputs = tf.keras.Input(shape=(self.feature_size,))
    dense = Dense(self.dim * self.dim * 32, input_shape=(self.feature_size,),
                            kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
    x = dense(inputs)
    x = Reshape((self.dim, self.dim, 32))(x)
    # x = Conv2D(128, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
    # x = BatchNormalization(momentum=0.8)(x)
    # x = LeakyReLU(0.2)(x)
    # x = Dropout(rate=0.3)(x)

    x = Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
    # discriminator.add(BatchNormalization(momentum=0.8)(x)
    x = LeakyReLU(0.2)(x)
    x = Dropout(rate=0.25)(x)

    x = Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same')(x)
    # x = BatchNormalization(momentum=0.8)(x)
    x = LeakyReLU(0.2)(x)
    x = Dropout(rate=0.25)(x)

    x = Flatten()(x)
    outputs = Dense(1, activation='sigmoid', dtype=np.float32)(x)  # Original + dtype for mixed-precision
    # x= Dense(1)(x) # https://medium.com/swlh/gan-generative-adversarial-network-3706ebfef77e

    discriminator = tf.keras.Model(inputs = inputs, outputs=outputs, name='Discriminator')
    discriminator.compile(loss=self.discriminator_loss_function,
                          optimizer=self.discriminator_optimizer, metrics=['accuracy'])

    return discriminator

def discriminator_loss(self, real_output, synthetic_output):
    # https://stackoverflow.com/questions/55936611/why-doesnt-the-discriminators-and-generators-loss-change
    cross_entropy = tf.keras.losses.BinaryCrossentropy()
    real_loss = cross_entropy(tf.ones_like(real_output), real_output)
    fake_loss = cross_entropy(tf.zeros_like(synthetic_output), synthetic_output)
    total_loss = (real_loss + fake_loss) / 2

    return total_loss

def generator_loss(self, fake_output) -> object:
    # https://stackoverflow.com/questions/55936611/why-doesnt-the-discriminators-and-generators-loss-change
    cross_entropy = tf.keras.losses.BinaryCrossentropy()
    gen_loss = cross_entropy(tf.ones_like(fake_output), fake_output)

    return gen_loss

# Notice the use of `tf.function`. This annotation causes the function to be "compiled".
@tf.function()
def _train_gen(self, real_data):
    #self.logger.info('Called')
    noise = tf.random.normal([self._batch_size, self._noise_dim])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as _:
        # Generator creates synthetic _data
        synthetic_data = self._generator(noise, training=True)
        # Discriminator predicts on real _data
        real_data_pred = self._discriminator(real_data, training=False)
        # Discriminator predicts on synthetic _data
        synth_data_pred = self._discriminator(synthetic_data, training=False)
        # Calculate Generator loss based on Discriminator's predictions of synthetic _data
        gen_loss = self.generator_loss(synth_data_pred)
        # Calculate Discriminator loss based on Discriminator's predictions of real and synthetic _data
        disc_loss = self.discriminator_loss(real_data_pred, synth_data_pred)

    gradients_of_generator = gen_tape.gradient(gen_loss, self._generator.trainable_variables)
    self.generator_optimizer.apply_gradients(zip(gradients_of_generator, self._generator.trainable_variables))

    return gen_loss, disc_loss

# Notice the use of `tf.function`. This annotation causes the function to be "compiled".
@tf.function()
def _train_step(self, real_data):
    #self.logger.info('Called')
    noise = tf.random.normal([self._batch_size, self._noise_dim])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
        # Generator creates synthetic _data
        synthetic_data = self._generator(noise, training=True)
        # Discriminator predicts on real _data
        real_data_pred = self._discriminator(real_data, training=True)
        # Discriminator predicts on synthetic _data
        synth_data_pred = self._discriminator(synthetic_data, training=True)
        # Calculate Generator loss based on Discriminator's predictions of synthetic _data
        gen_loss = self.generator_loss(synth_data_pred)
        # Calculate Discriminator loss based on Discriminator's predictions of real and synthetic _data
        disc_loss = self.discriminator_loss(real_data_pred, synth_data_pred)

    gradients_of_generator = gen_tape.gradient(gen_loss, self._generator.trainable_variables)
    gradients_of_discriminator = disc_tape.gradient(disc_loss, self._discriminator.trainable_variables)
    self.generator_optimizer.apply_gradients(zip(gradients_of_generator, self._generator.trainable_variables))
    self.discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator,
                                                     self._discriminator.trainable_variables))

    return gen_loss, disc_loss

def train(self, job: Job, epochs=None) -> None:
    self.job = job;
    start = time.time()
    update_time = start + 60*5  # 5 minutes
    if epochs is None:
        epochs = job.db_job.epochs
    update_calc = lambda x: int(epochs / 10) if (int(epochs / 10) > 0) else 1
    update_rate = update_calc(epochs)
    # assert not np.any(np.isnan(data.normalized.values))
    self._batch_size = self._round_batch_size(job.db_job.batch_size)
    if len(self.data.normalized) < self._batch_size:
        self._batch_size = self._round_batch_size(len(self.data.normalized))
        self.logger.info(f'Specified batch_size greater than sample size: changing to {self._batch_size}')
    self.logger.debug(f'Starting training of {epochs} epochs with batch size of {self._batch_size}')
    self.logger.info(f'data.normalized.shape[0] = {self.data.normalized.shape[0]}')
    num_batches = int(len(self.data.normalized) / self._batch_size)
    gen_count = 0
    dis_count = 0
    disc_loss = 0.0
    gen_loss = 0.0

    for epoch in range(epochs):
        for batch_num in range(num_batches):  # Train on all data in each epoch
            # indexes = randint(0, self.data.normalized.shape[0], batch_size)  # Generate non-contiguous random indexes
            if self._batch_size > self.data.normalized.shape[0]:
                index = randint(0, self.data.normalized.shape[0] - self._batch_size)  # Generate contiguous random indexes
            else:
                index = 0
            real_data = self.data.normalized.values[index: index+self._batch_size]
            # self.logger.debug(f'Batch: {batch_num*batch_size} : {(batch_num*batch_size + batch_size)}')
            gen_loss, disc_loss = self._train_step(real_data)
            # else:
            #     if balance and gen_loss > disc_loss:  # Train just gen
            #         gen_loss, disc_loss = self._train_gen(real_data)
            #         gen_count += 1
            #     else:  # Train both again
            #         gen_loss, disc_loss = self._train_step(real_data)
            #         dis_count += 1
        # Save the model every periodically
        if epoch % update_rate == 0 or time.time() > update_time:
            # etime = (time.time() - start)
            update_time = time.time() + 5 * 60  # 5 minutes
            self.logger.debug(f'Epoch: {epoch}   \tgen loss: {gen_loss:.7f} \tdisc loss: {disc_loss:.7f} \t'
                              f'gen count: {gen_count} \tdis_count: {dis_count}')

            self.update_status(f'training epoch {epoch}')
            # self.logger.debug(f'Epoch: {epoch}\t time: {etime}\tDLoss = {d_loss:.4f} GLoss = {g_loss:.4f}')
            # self.checkpoint.save(file_prefix=self.checkpoint_prefix)
            if epoch > 100:
                self.save()
    self.logger.debug(f'Final gen loss: {gen_loss}  disc loss: {disc_loss}')
    # self.checkpoint.save(file_prefix=self.checkpoint_prefix)
    job.db_job.training_time = int(time.time() - start)  # Round to nearest second
    self.update_status(JobStatus.trained)
    self.save()  # Save the generator and discriminator models
    self._trained = True
    # self.checkpoint.save(file_prefix=self.checkpoint_prefix)

Run with batch size of 10240 or higher.

csteel45 commented 2 years ago

I think I found the problem. Looking through the TF source, the function CHECK_GT takes int32 as the arguments. For my model, the tensor size for batch of 1024 is [1024,250,250,64] which totals: 4,096,000,000. If I cast that to an int32, the resulting value is -198967296 which is what I see in the console when the program dies: .\tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-198967296 vs. 0) Since modern GPUs can handle massive models (my Titan RTX has 24GB), you need to change the int arguments to long in CHECK_GT (and similar functions).

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

csteel45 commented 2 years ago

I would like some feedback on my suggestion. I can put together a standalone test case if necessary, but it should be easy to guess that using int32 for the parameter size checks is going to cause problems as models get larger and GPU memory continues to increase.

qlzh727 commented 2 years ago

Adding @rohan100jain from tf.core team for this issue. From the latest message, it might be a int32 overflow issue somewhere in TF.

In the meantime, please provide a some reproducible example. Thanks.

csteel45 commented 2 years ago

It is without question an int32 overflow problem. Please look at the function

inline GpuLaunchConfig GetGpuLaunchConfig(int work_element_count, const Eigen::GpuDevice& d)

in tensorflow/core/util/gpu_launch_config.h. It takes an int as an argument. When work_element_count exceeds the system limit for int, the program aborts. You can check by calling it with a work_element_count of value 2200000000 or greater on an int32 system.

csteel45 commented 2 years ago

import tensorflow as tf
from tensorflow.keras.layers import Reshape, Dense, BatchNormalization, Flatten, LeakyReLU, UpSampling2D, Conv2D
from tensorflow.keras.initializers import he_uniform
import numpy as np

noise_dim = 128
dim = 10
upSamp = 5
feature_size = 328
batch_size = 1024
kernel_init = he_uniform()

inputs = tf.keras.Input(shape=(noise_dim,))
dense = Dense(128 * dim * dim, kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.02))
x = dense(inputs)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = Reshape((dim, dim, 128))(x)
x = UpSampling2D(size=(upSamp, upSamp))(x)
x = Conv2D(64, kernel_size=(5, 5), padding='same', kernel_initializer = kernel_init)(x)
x = BatchNormalization(momentum=0.8)(x)
x = LeakyReLU(0.2)(x)
x = UpSampling2D(size=(upSamp, upSamp))(x)
x = Conv2D(1, kernel_size=(5, 5), padding='same', kernel_initializer = kernel_init)(x)
x = BatchNormalization(momentum=0.8)(x)
x = Flatten()(x)
outputs = Dense(feature_size, activation='tanh', dtype=np.float32)(x)  # All output between -1 and 1

generator = tf.keras.Model(inputs=inputs, outputs=outputs, name='Generator')

noise = tf.random.normal([batch_size, noise_dim])
@tf.function()
def train_step(data):
    with tf.GradientTape() as gen_tape:
        synthetic_data = generator(noise, training=True)
data = tf.random.normal([batch_size, feature_size])
train_step(data)

csteel45 commented 2 years ago

Make sure you run the code above on a GPU with enough memory that you don't get an OOM exception. This code was tested on an NVidia Titan RTX with 24GB of GPU memory.

csteel45 commented 2 years ago

Please let me know if the sample code above reproduced the problem and if it is helpful.

mikeymezher commented 2 years ago

@csteel45 Thanks for posting this - I'm running into the same issue. Did you happen to find a work around? (Other than reducing the shape of the tensor in question)

jtchilders commented 2 years ago

I'm running into a similar issue on 80GB A100 DGX machines.

csteel45 commented 2 years ago

I did not find a work-around other modifying gpu_launch_config.h and recompiling or reducing the tensor size. I am surprised they haven't fixed this yet.

jtchilders commented 2 years ago

Thanks, isgpu_launch_config.h the only place this needs editing?

jtchilders commented 2 years ago

Just to be clear, @csteel45, you made this change to tensorflow/core/util/gpu_launch_config.h:

template <typename DeviceFunc>
GpuLaunchConfig GetGpuLaunchConfig(int64_t work_element_count,
                                   const Eigen::GpuDevice& d, DeviceFunc func,
                                   size_t dynamic_shared_memory_size,
                                   int block_size_limit) {
  CHECK_GT(work_element_count, 0);
  GpuLaunchConfig config;

csteel45 commented 2 years ago

Apologies, it has been a while and I have moved on from that model. I really can't remember if there were any other changes. I did get it working, so there is a way. Try that change and it if fails, checks the messages and make updates elsewhere where required. This was for a POC so I didn't track any changes or take any notes, I just hacked away until it worked. I'm not sure I even have the fixed version lying around.

SeanLee97 commented 2 years ago

Any update about this issue?

I still run into this issue.

My Environments:

CUDA11.4
NVIDIA A100 (80GB)
TensorFlow 2.9.0

The model is implemented using Keras API, and uses TF Dataset to load data.

SeanLee97 commented 2 years ago

Any update about this issue?

I still run into this issue.

My Environments:

CUDA11.4

NVIDIA A100 (80GB)

TensorFlow 2.9.0

The model is implemented using Keras API, and uses TF Dataset to load data.

In my implementation, I used VarLenFeature and used to_dense to convert SparseTensor to DenseTensor, this might cause memory leaky. After I changed to use FixedLenFeature, the issue was fixed.

gototophao commented 2 years ago

could you tell me how you solve this problem, i start to learn this so dont understand it.

gototophao commented 2 years ago

@SeanLee97

yufang67 commented 1 year ago

I run into the same issue on A100_80G GPU. is there any update on this ?

SeanLee97 commented 1 year ago

I run into the same issue on A100_80G GPU. is there any update on this ?

@yufang67 it might be the memory issue. try to decrease batch size.

yufang67 commented 1 year ago

@SeanLee97 Thanks. Yes, decrease batch size can run without the issue, but in this case we cant fully utilize the GPU memory. i have run successfully on A100_40G and when switching to A100_80G i increases the batch size around 30%. So it should be not a OOM issue. (if your mentioned memory issue is OOM :) )

csteel45 commented 1 year ago

Apologies for the late response. This is exactly a memory issue. The Keras team is using a 32-bit variable to store work_element_count instead of a 64-bit. This limits the number of elements that can be stored, preventing you from taking advantage of the large memory capacity of GPUs like the Titan RTX. The work around is to use smaller batch sizes and such, but then you can’t maximize your training capability.

-Chris

From: yufang67 @.> Sent: Thursday, March 2, 2023 9:20 AM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)

@SeanLee97 https://github.com/SeanLee97 Thanks. Yes, decrease batch size can run without the issue, but in this case we cant fully utilize the GPU memory. i have run successfully on A100_40G and when switching to A100_80G i increases the batch size around 30%. So it should be not a OOM issue. (if your mentioned memory issue is OOM :) )

— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTLUVBQKMTDK23Q2QR3W2CUCTANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTL3B7GMALIEVVJB3ZTW2CUCTA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2FOKFA.gif Message ID: @. @.> >

yufang67 commented 1 year ago

Hi @csteel45, thanks for response. i tried to fix int32 definition in tensorflow/core/util/gpu_launch_config.h in tf2.9.3. Now i dont have work_element_count > 0 error, but i obtained some overflow in gradient during backprop. Do you have any idea of this ?

File "/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 471, in _get_gradients grads = tape.gradient(loss, var_list, grad_loss) Node: 'gradient_tape/rnn_transducer/prediction/lstm/while/gradients/rnn_transducer/prediction/lstm/while/lstm_cell/mul_grad/BroadcastGradientArgs' Incompatible shapes: [0,-2147450880] vs. [-2147483648,32768]

csteel45 commented 1 year ago

It has been a few years since I patched my version of TF but I recall that you have to look for all of the places work_element_count is used and make sure that all of the variables that it is being copied into are int64 as well. It was a lot of trial and error and took me 3 or 4 days to get it patched to the point it was working for my model.

-Chris

From: yufang67 @.> Sent: Thursday, March 2, 2023 2:06 PM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)

Hi @csteel45 https://github.com/csteel45 , thanks for response. i tried to fix int32 definition in tensorflow/core/util/gpu_launch_config.h in tf2.9.3. Now i dont have work_element_count > 0 error, but i obtained some overflow in gradient during backprop. Do you have any idea of this ?

File "/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 471, in _get_gradients grads = tape.gradient(loss, var_list, grad_loss) Node: 'gradient_tape/rnn_transducer/prediction/lstm/while/gradients/rnn_transducer/prediction/lstm/while/lstm_cell/mul_grad/BroadcastGradientArgs' Incompatible shapes: [0,-2147450880] vs. [-2147483648,32768]

— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTJM77O6CMJESGGG3R3W2DVSHANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTJS652KK7RL23WXJJTW2DVSHA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2JA7HI.gif Message ID: @. @.> >

yufang67 commented 1 year ago

@csteel45 Thanks for the suggestion. i will try it. which tf version you patched ?

csteel45 commented 1 year ago

Maybe 1.13 if I remember, it has been a while.

-Chris

From: yufang67 @.> Sent: Thursday, March 2, 2023 2:14 PM To: keras-team/keras @.> Cc: Christopher Steel @.>; Mention @.> Subject: Re: [keras-team/keras] tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) (Issue keras-team/tf-keras#124)

@csteel45 https://github.com/csteel45 Thanks for the suggestion. i will try it. which tf version you patched ?

— Reply to this email directly, view it on GitHub https://github.com/keras-team/tf-keras/issues/124 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHHTOVX45C2FNTCTA4C53W2DWNZANCNFSM5IWHDU7A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AAHHHTMYW4ZJR3WSNTVWVDDW2DWNZA5CNFSM5IWHDU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOK2JDFRY.gif Message ID: @. @.> >

tengwenxuan commented 1 year ago

We have a patch for TF 2.4.4, but it doesn't work properly (training divergence) on 2.9 or later versions, probably due to potential refactoring differences between TF 2.9 and TF2.4.4. The issue is that TF can't utilize the increased memory capacity of newer GPUs (80G), which is a significant limitation, it's surprising (at least to me) that the TensorFlow team hasn't prioritized a solution for this. We can share our patch to TF2.4.4 as an PR if this helps.

bit-pku-zdf commented 9 months ago

I also met this problems. I use tf1.15.5, and batch is large, can tf development engineers help fix it?

sachinprasadhs commented 9 months ago

@bit-pku-zdf , We no longer support TensorFlow 1.x related issues, use the latest version of TensorFlow for the latest error updates. You can also use Keras 3 with TensorFlow backend, for more details refer https://keras.io/guides/migrating_to_keras_3/

keras-team / tf-keras

tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (-1018167296 vs. 0) #124