tensorflow / recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Apache License 2.0
592 stars 133 forks source link

ERROR: tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory #371

Closed VikashPeddakota999 closed 10 months ago

VikashPeddakota999 commented 11 months ago

System information

Describe the bug

The embedding size is pretty small, less than 30MB (check the folder size after running it using a single GPU)

Code to reproduce the issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

rhdong commented 11 months ago

Hi @VikashPeddakota999, thank you for your feedback, we will resolve the issue ASAP and back to you. Thank you!

rhdong commented 11 months ago

Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!

VikashPeddakota999 commented 11 months ago

Thanks @rhdong. Also an additional data point which might be of some help when training with multi GPU - the following works (saving on gpu 0): if hvd.rank() == 0: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())

but when trying to save on gpu1 (as below), it fails with the same error if hvd.rank() == 1: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())

[1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:06036] Process received signal [1,1]:[27c929c21716:06036] Signal: Aborted (6) [1,1]:[27c929c21716:06036] Signal code: (-6) [1,1]:[27c929c21716:06036] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff70c23e090] [1,1]:[27c929c21716:06036] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff70c23e00b] [1,1]:[27c929c21716:06036] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff70c21d859] [1,1]:[27c929c21716:06036] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff70acb7911] [1,1]:[27c929c21716:06036] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff70acc338c] [1,1]:[27c929c21716:06036] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff70acc33f7] [1,1]:[27c929c21716:06036] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff70acc36a9] [1,1]:[27c929c21716:06036] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7ff555b6d7e6] [1,1]:[27c929c21716:06036] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7ff555b81097] [1,1]:[27c929c21716:06036] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7ff555b81f9b]

VikashPeddakota999 commented 11 months ago

Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!

sure @rhdong will try this. Thanks

rhdong commented 11 months ago

You're welcome!

VikashPeddakota999 commented 11 months ago

You're welcome!

@rhdong I'm already using allow_growth=True. Is there any other specific config parameter I'm missing? Im using the following configs -

physical_devices = tf.config.list_physical_devices('GPU') tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU') tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()], True) os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" #VERY IMPORTANT! os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"

VikashPeddakota999 commented 11 months ago

@rhdong @MoFHeka Attaching my Training code I'm using below in case you want to replicate the issue -

from typing import List, Dict, Optional, Set, Tuple, get_type_hints
import os
import shutil
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
from absl import flags
from absl import app
from tensorflow_recommenders_addons import dynamic_embedding as de
import horovod.tensorflow as hvd

os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true"  #VERY IMPORTANT!
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
hvd.init()
if hvd.rank() > 0:
  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU')
tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()],
                                         True)

os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit'
tf.config.experimental.set_synchronous_execution(False)

data_config_file = "../config/data.conf"

num_shards_per_host = 16 #hvd.size()*2
batch_size = 4096
shuffle_buffer_size = 10000

def get_dataset(batch_size=1):
  ds = tfds.load("movielens/1m-ratings",
                 split="train",
                 data_dir="/dataset",
                 download=False)

  features = ds.map(
      lambda x: ({
          "movie_id":
              tf.strings.to_number(x["movie_id"], tf.int64),
          "movie_genres":
              tf.cast(x["movie_genres"][0], tf.int64),
          "user_id":
              tf.strings.to_number(x["user_id"], tf.int64),
          "user_gender":
              tf.cast(x["user_gender"], tf.int64),
          "user_occupation_label":
              tf.cast(x["user_occupation_label"], tf.int64),
          "timestamp":
              tf.cast(x["timestamp"] - 880000000, tf.int64),

      }, tf.one_hot(tf.cast(x["user_rating"], tf.int64), 5)))

  shuffled = features.shuffle(1_000_000,
                             seed=2021,
                             reshuffle_each_iteration=False)
  dataset = shuffled.batch(batch_size).prefetch(tf.data.AUTOTUNE).repeat()
  return dataset 

dataset = get_dataset(batch_size)

# for i,data in enumerate(dataset):
#     print(data)
#     break

#using just userid, postid for testing placed on gpu0 and gpu1 respectively
class ChannelEmbeddingLayers(tf.keras.Model):

  def __init__(self,
               name='',
               dense_embedding_size=1,
               sparse_embedding_size=1,
               embedding_initializer=tf.keras.initializers.Zeros(),
               mpi_size=1,
               mpi_rank=0,
              is_training = True):    

    self.devices = ["GPU:0", "GPU:1", "GPU:2", "GPU:3"] #physical devices name 
    if is_training:
      de.enable_train_mode()
      if embedding_initializer is None:
        embedding_initializer = tf.keras.initializers.VarianceScaling()
    else:
      de.enable_inference_mode()
      if embedding_initializer is None:
        embedding_initializer = tf.keras.initializers.Zeros()

    super(ChannelEmbeddingLayers, self).__init__()
    # The saver parameter of kv_creator saves the K-V in the hash table into a separate KV file.
    self.kv_creator1 =  de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs1",
                                                                           proc_size=mpi_size, proc_rank=mpi_rank))
    self.kv_creator2 =  de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs2",
                                                                           proc_size=mpi_size, proc_rank=mpi_rank))

    self.layer_1 =  de.keras.layers.HvdAllToAllEmbedding(
        mpi_size=mpi_size,
        embedding_size=dense_embedding_size,
        key_dtype=tf.int64,
        value_dtype=tf.float32,
        initializer=embedding_initializer,
        devices=self.devices[0],
        name=name + '_layer1',
        bp_v2=True,
        init_capacity=4500000,
        kv_creator=self.kv_creator1)

    self.layer_2 =  de.keras.layers.HvdAllToAllEmbedding(
        mpi_size=mpi_size,
        embedding_size=sparse_embedding_size,
        key_dtype=tf.int64,
        value_dtype=tf.float32,
        initializer=embedding_initializer,
        devices=self.devices[1],
        name=name + '_layer2',
        init_capacity=4500000,
        bp_v2=True,
        kv_creator=self.kv_creator2)

    self.dnn3 = tf.keras.layers.Dense(
        5,
        activation='softmax',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
    self.tower1 = tf.keras.layers.Dense(
        32,
        activation='relu',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
    self.tower2 = tf.keras.layers.Dense(
        32,
        activation='relu',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))

  def __call__(self, features_info, training=False):

    ### user tower
    device_1_inputs = [features_info["user_id"]]
    print("device_1_inputs: ", device_1_inputs)
    device_1_output = self.layer_1(device_1_inputs)[0, :, :]
    print("device_1_output: ", device_1_output)
    tower1_output = self.tower1(device_1_output)
    print("tower1_output: ", tower1_output)

    device_2_inputs = [features_info["movie_id"]]                   
    device_2_output =self.layer_2(device_2_inputs)[0, :, :]
    tower2_output = self.tower2(device_2_output)
    print("tower2_output: ", tower2_output)

    embeddings_concat = tf.keras.layers.Concatenate(axis=1)([tower1_output, tower2_output])
    print("embeddings_concat: ", embeddings_concat)
    x = self.dnn3(embeddings_concat)

    print("x: ", x)
    return x

embedding_size = 32
model = ChannelEmbeddingLayers("recall", embedding_size, embedding_size,
                                tf.keras.initializers.RandomNormal(0.0, 0.5),
                                hvd.size(), hvd.rank())

optimizer = tfa.optimizers.LazyAdam(1E-3)
optimizer = de.DynamicEmbeddingOptimizer(optimizer)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy()
               )

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir= "/data/recall_debug_tb", profile_batch = '3,12')

# horovod callback is used to broadcast the value generated by initializer of rank0.
hvd_opt_init_callback = de.keras.callbacks.DEHvdBroadcastGlobalVariablesCallback(
      root_rank=0)
callbacks_list = [hvd_opt_init_callback]
if hvd.rank() == 0:
   callbacks_list.extend([tensorboard_callback])

print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())

print("======= training with batchize =======", batch_size)
model.fit(dataset,
            callbacks=callbacks_list,
            epochs=1,
            steps_per_epoch=1000,
            verbose=1 if hvd.rank() == 0 else 0 ) #if hvd.rank() == 0 else 0

print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())

if hvd.rank() == 0:
    print("saving in gpu 0")
    model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu0",
                                         proc_size=hvd.size(),
                                         proc_rank=hvd.rank())
    print("saving finished in gpu 0")

print("=======saving layer_1 emb========")
if hvd.rank() == 1:
    print("saving in gpu 1")
    model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu1",
                                         proc_size=hvd.size(),
                                         proc_rank=hvd.rank())
    print("saving finished in gpu 1")

I'm not able to understand the difference in behaviour of GPU0 and GPU1. Any help in resolving this would be highly appreciated. Thanks!

MoFHeka commented 11 months ago

@VikashPeddakota999 What's more, please try running sync op before saving beginning and after finishing. Which will prevent one worker has being finished while another worker still in saving process. Example: https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/callbacks.py https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/models.py Also you can use de.models.de_hvd_save_model function to save model.

VikashPeddakota999 commented 11 months ago

@MoFHeka Got the same error along with something related to horovod init. can we reopen the issue Please?

[1,0]:2023-11-27 17:19:28.543524: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at mpi_ops.cc:1604 : FAILED_PRECONDITION: Horovod has not been initialized; use hvd.init(). [1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:87519] Process received signal [1,1]:[27c929c21716:87519] Signal: Aborted (6) [1,1]:[27c929c21716:87519] Signal code: (-6) [1,1]:[27c929c21716:87519] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f11e0cdd090] [1,1]:[27c929c21716:87519] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f11e0cdd00b] [1,1]:[27c929c21716:87519] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f11e0cbc859] [1,1]:[27c929c21716:87519] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f11df756911] [1,1]:[27c929c21716:87519] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f11df76238c] [1,1]:[27c929c21716:87519] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f11df7623f7] [1,1]:[27c929c21716:87519] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f11df7626a9] [1,1]:[27c929c21716:87519] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f102a64c7e6] [1,1]:[27c929c21716:87519] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f102a660097] [1,1]:[27c929c21716:87519] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f102a660f9b] [1,1]:[27c929c21716:87519] [10] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f11af3b5f40] [1,1]:[27c929c21716:87519] [11] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f11bc3bd29f] [1,1]:[27c929c21716:87519] [12] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f11b57cb95c] [1,1]:[27c929c21716:87519] [13] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f11b57cbe59] [1,1]:[27c929c21716:87519] [14] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f11bc8fe0a6] [1,1]:[27c929c21716:87519] [15] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f11bc900eac] [1,1]:[27c929c21716:87519] [16] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f11afbe29b5] [1,1]:[27c929c21716:87519] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f11e0c7f609] [1,1]:[27c929c21716:87519] [18] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f11e0db9133] [1,1]:[27c929c21716:87519] End of error message

cmgreen210 commented 11 months ago

I think the confusion stems from this line in the keras horovod demo which sets

tf.config.experimental.set_synchronous_execution(False)

This (at least in my experimentation) makes lines like hvd.join() not behave like expected, i.e. don't actually set a barrier.

rhdong commented 11 months ago

Hi @MoFHeka, I've noticed there might be a few remaining issues. If you have a moment, would you be able to address them?

MoFHeka commented 10 months ago

I think the confusion stems from this line in the keras horovod demo which sets

tf.config.experimental.set_synchronous_execution(False)

This (at least in my experimentation) makes lines like hvd.join() not behave like expected, i.e. don't actually set a barrier.

@cmgreen210 I'm not sure if hvd.join() would be disable when set_synchronous_execution. Cause according to this link, this setting only influence SyncExecutors. Operator hvd.join() only use for making sure there are no file conflicts or worker died before finish saving. There is no while_loop or anything else could trigger TensorFlow auto concurrence. And the save/restore OP are all sync kernel , also with many python code without TF eager graph.

I may be wrong, but I think that hvd.join() should not be run before the save is complete.

MoFHeka commented 10 months ago
OOM Error > **[1,0]:2023-11-27 17:19:28.543524: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at mpi_ops.cc:1604 : FAILED_PRECONDITION: Horovod has not been initialized; use hvd.init().** [1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:87519] *** Process received signal *** [1,1]:[27c929c21716:87519] Signal: Aborted (6) [1,1]:[27c929c21716:87519] Signal code: (-6) [1,1]:[27c929c21716:87519] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f11e0cdd090] [1,1]:[27c929c21716:87519] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f11e0cdd00b] [1,1]:[27c929c21716:87519] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f11e0cbc859] [1,1]:[27c929c21716:87519] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f11df756911] [1,1]:[27c929c21716:87519] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f11df76238c] [1,1]:[27c929c21716:87519] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f11df7623f7] [1,1]:[27c929c21716:87519] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f11df7626a9] [1,1]:[27c929c21716:87519] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f102a64c7e6] [1,1]:[27c929c21716:87519] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f102a660097] [1,1]:[27c929c21716:87519] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f102a660f9b] [1,1]:[27c929c21716:87519] [10] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f11af3b5f40] [1,1]:[27c929c21716:87519] [11] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f11bc3bd29f] [1,1]:[27c929c21716:87519] [12] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f11b57cb95c] [1,1]:[27c929c21716:87519] [13] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f11b57cbe59] [1,1]:[27c929c21716:87519] [14] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f11bc8fe0a6] [1,1]:[27c929c21716:87519] [15] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f11bc900eac] [1,1]:[27c929c21716:87519] [16] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f11afbe29b5] [1,1]:[27c929c21716:87519] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f11e0c7f609] [1,1]:[27c929c21716:87519] [18] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f11e0db9133] [1,1]:[27c929c21716:87519] *** End of error message ***

@VikashPeddakota999 Please try to set a smaller buffer_size parameter of save/restore function.

VikashPeddakota999 commented 10 months ago

@MoFHeka it's finally fixed, after removing "tf.config.experimental.set_synchronous_execution(False)" as cmgreen suggested and updating the nvidia drivers. we can close it now.

MoFHeka commented 10 months ago

@VikashPeddakota999 set_synchronous_execution? This is very strange, this option should not cause the GPU kernel to execute asynchronously. Thank you for your efforts.