ERROR: tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory

System information

OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow version == 2.8.3 and was installed through pip
TensorFlow-Recommenders-Addons version : '0.6.0-dev' and was installed through source
Python version: 3.8.10
Is GPU used? yes
using "tfra/dev_container:latest-python3.8" docker image and installed from source using https://github.com/tensorflow/recommenders-addons/tree/master#installing-from-source

Describe the bug

Trained a simple DNN model using de.keras.layers.HvdAllToAllEmbedding for embeddings
saving the model using: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())
Running it on 1 GPU using horovodrun -np 1 python train.py is working fine but when i use 2 GPUs using horovodrun -np 1 python train.py throws the following error [1,1]<stderr>: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]<stderr>:[e96db920062a:166347] *** Process received signal *** [1,1]<stderr>:[e96db920062a:166347] Signal: Aborted (6) [1,1]<stderr>:[e96db920062a:166347] Signal code: (-6) [1,1]<stderr>:[e96db920062a:166347] [ 0] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f876e93a090] [1,1]<stderr>:[e96db920062a:166347] [ 1] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f876e93a00b] [1,1]<stderr>:[e96db920062a:166347] [ 2] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f876e919859] [1,1]<stderr>:[e96db920062a:166347] [ 3] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f876d3ad911] [1,1]<stderr>:[e96db920062a:166347] [ 4] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f876d3b938c] [1,1]<stderr>:[e96db920062a:166347] [ 5] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f876d3b93f7] [1,1]<stderr>:[e96db920062a:166347] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f876d3b96a9] [1,1]<stderr>:[e96db920062a:166347] [ 7] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f85ad3a17e6] [1,1]<stderr>:[e96db920062a:166347] [ 8] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f85ad3b5097] [1,1]<stderr>:[e96db920062a:166347] [ 9] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f85ad3b5f9b] [1,1]<stderr>:[e96db920062a:166347] [10] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f873d00bf40] [1,1]<stderr>:[e96db920062a:166347] [11] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f874a01381f] [1,1]<stderr>:[e96db920062a:166347] [12] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f8743421ecc] [1,1]<stderr>:[e96db920062a:166347] [13] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f87434223c9] [1,1]<stderr>:[e96db920062a:166347] [14] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f874a554626] [1,1]<stderr>:[e96db920062a:166347] [15] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f874a55742c] [1,1]<stderr>:[e96db920062a:166347] [16] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f873d8389b5] [1,1]<stderr>:[e96db920062a:166347] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f876e8dc609] [1,1]<stderr>:[e96db920062a:166347] [18] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f876ea16133]

The embedding size is pretty small, less than 30MB (check the folder size after running it using a single GPU)

Code to reproduce the issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Hi @VikashPeddakota999, thank you for your feedback, we will resolve the issue ASAP and back to you. Thank you!

Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!

Thanks @rhdong. Also an additional data point which might be of some help when training with multi GPU - the following works (saving on gpu 0): if hvd.rank() == 0: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())

but when trying to save on gpu1 (as below), it fails with the same error if hvd.rank() == 1: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())

[1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:06036] Process received signal [1,1]:[27c929c21716:06036] Signal: Aborted (6) [1,1]:[27c929c21716:06036] Signal code: (-6) [1,1]:[27c929c21716:06036] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff70c23e090] [1,1]:[27c929c21716:06036] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff70c23e00b] [1,1]:[27c929c21716:06036] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff70c21d859] [1,1]:[27c929c21716:06036] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff70acb7911] [1,1]:[27c929c21716:06036] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff70acc338c] [1,1]:[27c929c21716:06036] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff70acc33f7] [1,1]:[27c929c21716:06036] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff70acc36a9] [1,1]:[27c929c21716:06036] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7ff555b6d7e6] [1,1]:[27c929c21716:06036] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7ff555b81097] [1,1]:[27c929c21716:06036] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7ff555b81f9b]

Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!

sure @rhdong will try this. Thanks

You're welcome!

@rhdong I'm already using allow_growth=True. Is there any other specific config parameter I'm missing? Im using the following configs -

physical_devices = tf.config.list_physical_devices('GPU') tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU') tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()], True) os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" #VERY IMPORTANT! os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"

@rhdong @MoFHeka Attaching my Training code I'm using below in case you want to replicate the issue -

from typing import List, Dict, Optional, Set, Tuple, get_type_hints
import os
import shutil
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
from absl import flags
from absl import app
from tensorflow_recommenders_addons import dynamic_embedding as de
import horovod.tensorflow as hvd

os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true"  #VERY IMPORTANT!
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
hvd.init()
if hvd.rank() > 0:
  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU')
tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()],
                                         True)

os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit'
tf.config.experimental.set_synchronous_execution(False)

data_config_file = "../config/data.conf"

num_shards_per_host = 16 #hvd.size()*2
batch_size = 4096
shuffle_buffer_size = 10000

def get_dataset(batch_size=1):
  ds = tfds.load("movielens/1m-ratings",
                 split="train",
                 data_dir="/dataset",
                 download=False)

  features = ds.map(
      lambda x: ({
          "movie_id":
              tf.strings.to_number(x["movie_id"], tf.int64),
          "movie_genres":
              tf.cast(x["movie_genres"][0], tf.int64),
          "user_id":
              tf.strings.to_number(x["user_id"], tf.int64),
          "user_gender":
              tf.cast(x["user_gender"], tf.int64),
          "user_occupation_label":
              tf.cast(x["user_occupation_label"], tf.int64),
          "timestamp":
              tf.cast(x["timestamp"] - 880000000, tf.int64),

      }, tf.one_hot(tf.cast(x["user_rating"], tf.int64), 5)))

  shuffled = features.shuffle(1_000_000,
                             seed=2021,
                             reshuffle_each_iteration=False)
  dataset = shuffled.batch(batch_size).prefetch(tf.data.AUTOTUNE).repeat()
  return dataset 

dataset = get_dataset(batch_size)

# for i,data in enumerate(dataset):
#     print(data)
#     break

#using just userid, postid for testing placed on gpu0 and gpu1 respectively
class ChannelEmbeddingLayers(tf.keras.Model):

  def __init__(self,
               name='',
               dense_embedding_size=1,
               sparse_embedding_size=1,
               embedding_initializer=tf.keras.initializers.Zeros(),
               mpi_size=1,
               mpi_rank=0,
              is_training = True):    

    self.devices = ["GPU:0", "GPU:1", "GPU:2", "GPU:3"] #physical devices name 
    if is_training:
      de.enable_train_mode()
      if embedding_initializer is None:
        embedding_initializer = tf.keras.initializers.VarianceScaling()
    else:
      de.enable_inference_mode()
      if embedding_initializer is None:
        embedding_initializer = tf.keras.initializers.Zeros()

    super(ChannelEmbeddingLayers, self).__init__()
    # The saver parameter of kv_creator saves the K-V in the hash table into a separate KV file.
    self.kv_creator1 =  de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs1",
                                                                           proc_size=mpi_size, proc_rank=mpi_rank))
    self.kv_creator2 =  de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs2",
                                                                           proc_size=mpi_size, proc_rank=mpi_rank))

    self.layer_1 =  de.keras.layers.HvdAllToAllEmbedding(
        mpi_size=mpi_size,
        embedding_size=dense_embedding_size,
        key_dtype=tf.int64,
        value_dtype=tf.float32,
        initializer=embedding_initializer,
        devices=self.devices[0],
        name=name + '_layer1',
        bp_v2=True,
        init_capacity=4500000,
        kv_creator=self.kv_creator1)

    self.layer_2 =  de.keras.layers.HvdAllToAllEmbedding(
        mpi_size=mpi_size,
        embedding_size=sparse_embedding_size,
        key_dtype=tf.int64,
        value_dtype=tf.float32,
        initializer=embedding_initializer,
        devices=self.devices[1],
        name=name + '_layer2',
        init_capacity=4500000,
        bp_v2=True,
        kv_creator=self.kv_creator2)

    self.dnn3 = tf.keras.layers.Dense(
        5,
        activation='softmax',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
    self.tower1 = tf.keras.layers.Dense(
        32,
        activation='relu',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
    self.tower2 = tf.keras.layers.Dense(
        32,
        activation='relu',
        kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
        bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))

  def __call__(self, features_info, training=False):

    ### user tower
    device_1_inputs = [features_info["user_id"]]
    print("device_1_inputs: ", device_1_inputs)
    device_1_output = self.layer_1(device_1_inputs)[0, :, :]
    print("device_1_output: ", device_1_output)
    tower1_output = self.tower1(device_1_output)
    print("tower1_output: ", tower1_output)

    device_2_inputs = [features_info["movie_id"]]                   
    device_2_output =self.layer_2(device_2_inputs)[0, :, :]
    tower2_output = self.tower2(device_2_output)
    print("tower2_output: ", tower2_output)

    embeddings_concat = tf.keras.layers.Concatenate(axis=1)([tower1_output, tower2_output])
    print("embeddings_concat: ", embeddings_concat)
    x = self.dnn3(embeddings_concat)

    print("x: ", x)
    return x

embedding_size = 32
model = ChannelEmbeddingLayers("recall", embedding_size, embedding_size,
                                tf.keras.initializers.RandomNormal(0.0, 0.5),
                                hvd.size(), hvd.rank())

optimizer = tfa.optimizers.LazyAdam(1E-3)
optimizer = de.DynamicEmbeddingOptimizer(optimizer)
model.compile(optimizer=optimizer,
             loss=tf.keras.losses.BinaryCrossentropy()
               )

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir= "/data/recall_debug_tb", profile_batch = '3,12')

# horovod callback is used to broadcast the value generated by initializer of rank0.
hvd_opt_init_callback = de.keras.callbacks.DEHvdBroadcastGlobalVariablesCallback(
      root_rank=0)
callbacks_list = [hvd_opt_init_callback]
if hvd.rank() == 0:
   callbacks_list.extend([tensorboard_callback])

print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())

print("======= training with batchize =======", batch_size)
model.fit(dataset,
            callbacks=callbacks_list,
            epochs=1,
            steps_per_epoch=1000,
            verbose=1 if hvd.rank() == 0 else 0 ) #if hvd.rank() == 0 else 0

print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())

if hvd.rank() == 0:
    print("saving in gpu 0")
    model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu0",
                                         proc_size=hvd.size(),
                                         proc_rank=hvd.rank())
    print("saving finished in gpu 0")

print("=======saving layer_1 emb========")
if hvd.rank() == 1:
    print("saving in gpu 1")
    model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu1",
                                         proc_size=hvd.size(),
                                         proc_rank=hvd.rank())
    print("saving finished in gpu 1")

I'm able to save the layer weights in GPU0, can find the weights in "recall_debug_gpu0" directory
Getting the following error while saving weights in GPU1

[1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:33496] Process received signal [1,1]:[27c929c21716:33496] Signal: Aborted (6) [1,1]:[27c929c21716:33496] Signal code: (-6) [1,1]:[27c929c21716:33496] [ 0] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe05beb3090] [1,1]:[27c929c21716:33496] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe05beb300b] [1,1]:[27c929c21716:33496] [1,1]:[ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe05be92859] [1,1]:[27c929c21716:33496] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fe05a92c911] [1,1]:[27c929c21716:33496] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fe05a93838c] [1,1]:[27c929c21716:33496] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fe05a9383f7] [1,1]:[27c929c21716:33496] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fe05a9386a9] [1,1]:[27c929c21716:33496] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7fdea58227e6] [1,1]:[27c929c21716:33496] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7fdea5836097] [1,1]:[27c929c21716:33496] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7fdea5836f9b] [1,1]:[27c929c21716:33496] [10] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7fe02a58bf40] [1,1]:[27c929c21716:33496] [11] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7fe03759329f] [1,1]:[27c929c21716:33496] [12] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7fe0309a195c] [1,1]:[27c929c21716:33496] [13] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7fe0309a1e59] [1,1]:[27c929c21716:33496] [14] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7fe037ad40a6] [1,1]:[27c929c21716:33496] [15] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7fe037ad6eac] [1,1]:[27c929c21716:33496] [16] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7fe02adb89b5] [1,1]:[27c929c21716:33496] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fe05be55609] [1,1]:[27c929c21716:33496] [18] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe05bf8f133] [1,1]:[27c929c21716:33496] End of error message

I'm not able to understand the difference in behaviour of GPU0 and GPU1. Any help in resolving this would be highly appreciated. Thanks!

@VikashPeddakota999 What's more, please try running sync op before saving beginning and after finishing. Which will prevent one worker has being finished while another worker still in saving process. Example: https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/callbacks.py https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/models.py Also you can use de.models.de_hvd_save_model function to save model.

@MoFHeka Got the same error along with something related to horovod init. can we reopen the issue Please?

[1,0]:2023-11-27 17:19:28.543524: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at mpi_ops.cc:1604 : FAILED_PRECONDITION: Horovod has not been initialized; use hvd.init(). [1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:87519] Process received signal [1,1]:[27c929c21716:87519] Signal: Aborted (6) [1,1]:[27c929c21716:87519] Signal code: (-6) [1,1]:[27c929c21716:87519] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f11e0cdd090] [1,1]:[27c929c21716:87519] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f11e0cdd00b] [1,1]:[27c929c21716:87519] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f11e0cbc859] [1,1]:[27c929c21716:87519] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f11df756911] [1,1]:[27c929c21716:87519] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f11df76238c] [1,1]:[27c929c21716:87519] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f11df7623f7] [1,1]:[27c929c21716:87519] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f11df7626a9] [1,1]:[27c929c21716:87519] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f102a64c7e6] [1,1]:[27c929c21716:87519] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f102a660097] [1,1]:[27c929c21716:87519] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f102a660f9b] [1,1]:[27c929c21716:87519] [10] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f11af3b5f40] [1,1]:[27c929c21716:87519] [11] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f11bc3bd29f] [1,1]:[27c929c21716:87519] [12] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f11b57cb95c] [1,1]:[27c929c21716:87519] [13] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f11b57cbe59] [1,1]:[27c929c21716:87519] [14] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f11bc8fe0a6] [1,1]:[27c929c21716:87519] [15] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f11bc900eac] [1,1]:[27c929c21716:87519] [16] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f11afbe29b5] [1,1]:[27c929c21716:87519] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f11e0c7f609] [1,1]:[27c929c21716:87519] [18] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f11e0db9133] [1,1]:[27c929c21716:87519] End of error message

I think the confusion stems from this line in the keras horovod demo which sets

tf.config.experimental.set_synchronous_execution(False)

This (at least in my experimentation) makes lines like hvd.join() not behave like expected, i.e. don't actually set a barrier.

Hi @MoFHeka, I've noticed there might be a few remaining issues. If you have a moment, would you be able to address them?

I think the confusion stems from this line in the keras horovod demo which sets
tf.config.experimental.set_synchronous_execution(False)
This (at least in my experimentation) makes lines like hvd.join() not behave like expected, i.e. don't actually set a barrier.

@cmgreen210 I'm not sure if hvd.join() would be disable when set_synchronous_execution. Cause according to this link, this setting only influence SyncExecutors. Operator hvd.join() only use for making sure there are no file conflicts or worker died before finish saving. There is no while_loop or anything else could trigger TensorFlow auto concurrence. And the save/restore OP are all sync kernel , also with many python code without TF eager graph.

I may be wrong, but I think that hvd.join() should not be run before the save is complete.

OOM Error

> **[1,0]:2023-11-27 17:19:28.543524: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at mpi_ops.cc:1604 : FAILED_PRECONDITION: Horovod has not been initialized; use hvd.init().** [1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1]: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]:[27c929c21716:87519] *** Process received signal *** [1,1]:[27c929c21716:87519] Signal: Aborted (6) [1,1]:[27c929c21716:87519] Signal code: (-6) [1,1]:[27c929c21716:87519] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f11e0cdd090] [1,1]:[27c929c21716:87519] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f11e0cdd00b] [1,1]:[27c929c21716:87519] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f11e0cbc859] [1,1]:[27c929c21716:87519] [ 3] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f11df756911] [1,1]:[27c929c21716:87519] [ 4] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f11df76238c] [1,1]:[27c929c21716:87519] [ 5] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f11df7623f7] [1,1]:[27c929c21716:87519] [ 6] [1,1]:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f11df7626a9] [1,1]:[27c929c21716:87519] [ 7] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f102a64c7e6] [1,1]:[27c929c21716:87519] [ 8] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f102a660097] [1,1]:[27c929c21716:87519] [ 9] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f102a660f9b] [1,1]:[27c929c21716:87519] [10] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f11af3b5f40] [1,1]:[27c929c21716:87519] [11] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f11bc3bd29f] [1,1]:[27c929c21716:87519] [12] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f11b57cb95c] [1,1]:[27c929c21716:87519] [13] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f11b57cbe59] [1,1]:[27c929c21716:87519] [14] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f11bc8fe0a6] [1,1]:[27c929c21716:87519] [15] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f11bc900eac] [1,1]:[27c929c21716:87519] [16] [1,1]:/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f11afbe29b5] [1,1]:[27c929c21716:87519] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f11e0c7f609] [1,1]:[27c929c21716:87519] [18] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f11e0db9133] [1,1]:[27c929c21716:87519] *** End of error message ***

@VikashPeddakota999 Please try to set a smaller buffer_size parameter of save/restore function.

@MoFHeka it's finally fixed, after removing "tf.config.experimental.set_synchronous_execution(False)" as cmgreen suggested and updating the nvidia drivers. we can close it now.

@VikashPeddakota999 set_synchronous_execution? This is very strange, this option should not cause the GPU kernel to execute asynchronously. Thank you for your efforts.

tensorflow / recommenders-addons

ERROR: tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory #371