Closed VikashPeddakota999 closed 10 months ago
Hi @VikashPeddakota999, thank you for your feedback, we will resolve the issue ASAP and back to you. Thank you!
Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!
Thanks @rhdong. Also an additional data point which might be of some help when training with multi GPU -
the following works (saving on gpu 0):
if hvd.rank() == 0: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())
but when trying to save on gpu1 (as below), it fails with the same error
if hvd.rank() == 1: model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())
[1,1]
:terminate called after throwing an instance of 'nv::CudaException' [1,1] : what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1] :[27c929c21716:06036] Process received signal [1,1] :[27c929c21716:06036] Signal: Aborted (6) [1,1] :[27c929c21716:06036] Signal code: (-6) [1,1] :[27c929c21716:06036] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff70c23e090] [1,1] :[27c929c21716:06036] [ 1] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff70c23e00b] [1,1] :[27c929c21716:06036] [ 2] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff70c21d859] [1,1] :[27c929c21716:06036] [ 3] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff70acb7911] [1,1] :[27c929c21716:06036] [ 4] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff70acc338c] [1,1] :[27c929c21716:06036] [ 5] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff70acc33f7] [1,1] :[27c929c21716:06036] [ 6] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff70acc36a9] [1,1] :[27c929c21716:06036] [ 7] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7ff555b6d7e6] [1,1] :[27c929c21716:06036] [ 8] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7ff555b81097] [1,1] :[27c929c21716:06036] [ 9] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7ff555b81f9b]
Hi @VikashPeddakota999, could you try to check and set the GPUOptions configuration of TensorFlow to enable the HBM growth? Please refer to link. Thank you!
sure @rhdong will try this. Thanks
You're welcome!
You're welcome!
@rhdong I'm already using allow_growth=True. Is there any other specific config parameter I'm missing? Im using the following configs -
physical_devices = tf.config.list_physical_devices('GPU') tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU') tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()], True) os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" #VERY IMPORTANT! os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
@rhdong @MoFHeka Attaching my Training code I'm using below in case you want to replicate the issue -
from typing import List, Dict, Optional, Set, Tuple, get_type_hints
import os
import shutil
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
from absl import flags
from absl import app
from tensorflow_recommenders_addons import dynamic_embedding as de
import horovod.tensorflow as hvd
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" #VERY IMPORTANT!
os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
hvd.init()
if hvd.rank() > 0:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[hvd.local_rank()], 'GPU')
tf.config.experimental.set_memory_growth(physical_devices[hvd.local_rank()],
True)
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit'
tf.config.experimental.set_synchronous_execution(False)
data_config_file = "../config/data.conf"
num_shards_per_host = 16 #hvd.size()*2
batch_size = 4096
shuffle_buffer_size = 10000
def get_dataset(batch_size=1):
ds = tfds.load("movielens/1m-ratings",
split="train",
data_dir="/dataset",
download=False)
features = ds.map(
lambda x: ({
"movie_id":
tf.strings.to_number(x["movie_id"], tf.int64),
"movie_genres":
tf.cast(x["movie_genres"][0], tf.int64),
"user_id":
tf.strings.to_number(x["user_id"], tf.int64),
"user_gender":
tf.cast(x["user_gender"], tf.int64),
"user_occupation_label":
tf.cast(x["user_occupation_label"], tf.int64),
"timestamp":
tf.cast(x["timestamp"] - 880000000, tf.int64),
}, tf.one_hot(tf.cast(x["user_rating"], tf.int64), 5)))
shuffled = features.shuffle(1_000_000,
seed=2021,
reshuffle_each_iteration=False)
dataset = shuffled.batch(batch_size).prefetch(tf.data.AUTOTUNE).repeat()
return dataset
dataset = get_dataset(batch_size)
# for i,data in enumerate(dataset):
# print(data)
# break
#using just userid, postid for testing placed on gpu0 and gpu1 respectively
class ChannelEmbeddingLayers(tf.keras.Model):
def __init__(self,
name='',
dense_embedding_size=1,
sparse_embedding_size=1,
embedding_initializer=tf.keras.initializers.Zeros(),
mpi_size=1,
mpi_rank=0,
is_training = True):
self.devices = ["GPU:0", "GPU:1", "GPU:2", "GPU:3"] #physical devices name
if is_training:
de.enable_train_mode()
if embedding_initializer is None:
embedding_initializer = tf.keras.initializers.VarianceScaling()
else:
de.enable_inference_mode()
if embedding_initializer is None:
embedding_initializer = tf.keras.initializers.Zeros()
super(ChannelEmbeddingLayers, self).__init__()
# The saver parameter of kv_creator saves the K-V in the hash table into a separate KV file.
self.kv_creator1 = de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs1",
proc_size=mpi_size, proc_rank=mpi_rank))
self.kv_creator2 = de.CuckooHashTableCreator(saver=de.FileSystemSaver(#save_path = "fs2",
proc_size=mpi_size, proc_rank=mpi_rank))
self.layer_1 = de.keras.layers.HvdAllToAllEmbedding(
mpi_size=mpi_size,
embedding_size=dense_embedding_size,
key_dtype=tf.int64,
value_dtype=tf.float32,
initializer=embedding_initializer,
devices=self.devices[0],
name=name + '_layer1',
bp_v2=True,
init_capacity=4500000,
kv_creator=self.kv_creator1)
self.layer_2 = de.keras.layers.HvdAllToAllEmbedding(
mpi_size=mpi_size,
embedding_size=sparse_embedding_size,
key_dtype=tf.int64,
value_dtype=tf.float32,
initializer=embedding_initializer,
devices=self.devices[1],
name=name + '_layer2',
init_capacity=4500000,
bp_v2=True,
kv_creator=self.kv_creator2)
self.dnn3 = tf.keras.layers.Dense(
5,
activation='softmax',
kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
self.tower1 = tf.keras.layers.Dense(
32,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
self.tower2 = tf.keras.layers.Dense(
32,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1),
bias_initializer=tf.keras.initializers.RandomNormal(0.0, 0.1))
def __call__(self, features_info, training=False):
### user tower
device_1_inputs = [features_info["user_id"]]
print("device_1_inputs: ", device_1_inputs)
device_1_output = self.layer_1(device_1_inputs)[0, :, :]
print("device_1_output: ", device_1_output)
tower1_output = self.tower1(device_1_output)
print("tower1_output: ", tower1_output)
device_2_inputs = [features_info["movie_id"]]
device_2_output =self.layer_2(device_2_inputs)[0, :, :]
tower2_output = self.tower2(device_2_output)
print("tower2_output: ", tower2_output)
embeddings_concat = tf.keras.layers.Concatenate(axis=1)([tower1_output, tower2_output])
print("embeddings_concat: ", embeddings_concat)
x = self.dnn3(embeddings_concat)
print("x: ", x)
return x
embedding_size = 32
model = ChannelEmbeddingLayers("recall", embedding_size, embedding_size,
tf.keras.initializers.RandomNormal(0.0, 0.5),
hvd.size(), hvd.rank())
optimizer = tfa.optimizers.LazyAdam(1E-3)
optimizer = de.DynamicEmbeddingOptimizer(optimizer)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy()
)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir= "/data/recall_debug_tb", profile_batch = '3,12')
# horovod callback is used to broadcast the value generated by initializer of rank0.
hvd_opt_init_callback = de.keras.callbacks.DEHvdBroadcastGlobalVariablesCallback(
root_rank=0)
callbacks_list = [hvd_opt_init_callback]
if hvd.rank() == 0:
callbacks_list.extend([tensorboard_callback])
print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())
print("======= training with batchize =======", batch_size)
model.fit(dataset,
callbacks=callbacks_list,
epochs=1,
steps_per_epoch=1000,
verbose=1 if hvd.rank() == 0 else 0 ) #if hvd.rank() == 0 else 0
print("size model.layer_1.params: ", model.layer_1.params.size())
print("size model.layer_2.params: ", model.layer_2.params.size())
if hvd.rank() == 0:
print("saving in gpu 0")
model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu0",
proc_size=hvd.size(),
proc_rank=hvd.rank())
print("saving finished in gpu 0")
print("=======saving layer_1 emb========")
if hvd.rank() == 1:
print("saving in gpu 1")
model.layers[0].params.save_to_file_system(dirpath="recall_debug_gpu1",
proc_size=hvd.size(),
proc_rank=hvd.rank())
print("saving finished in gpu 1")
I'm able to save the layer weights in GPU0, can find the weights in "recall_debug_gpu0" directory
Getting the following error while saving weights in GPU1
[1,1]
:terminate called after throwing an instance of 'nv::CudaException' [1,1] : what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1] :[27c929c21716:33496] Process received signal [1,1] :[27c929c21716:33496] Signal: Aborted (6) [1,1] :[27c929c21716:33496] Signal code: (-6) [1,1] :[27c929c21716:33496] [ 0] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe05beb3090] [1,1] :[27c929c21716:33496] [ 1] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe05beb300b] [1,1] :[27c929c21716:33496] [1,1] :[ 2] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe05be92859] [1,1] :[27c929c21716:33496] [ 3] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fe05a92c911] [1,1] :[27c929c21716:33496] [ 4] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fe05a93838c] [1,1] :[27c929c21716:33496] [ 5] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fe05a9383f7] [1,1] :[27c929c21716:33496] [ 6] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fe05a9386a9] [1,1] :[27c929c21716:33496] [ 7] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7fdea58227e6] [1,1] :[27c929c21716:33496] [ 8] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7fdea5836097] [1,1] :[27c929c21716:33496] [ 9] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7fdea5836f9b] [1,1] :[27c929c21716:33496] [10] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7fe02a58bf40] [1,1] :[27c929c21716:33496] [11] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7fe03759329f] [1,1] :[27c929c21716:33496] [12] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7fe0309a195c] [1,1] :[27c929c21716:33496] [13] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7fe0309a1e59] [1,1] :[27c929c21716:33496] [14] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7fe037ad40a6] [1,1] :[27c929c21716:33496] [15] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7fe037ad6eac] [1,1] :[27c929c21716:33496] [16] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7fe02adb89b5] [1,1] :[27c929c21716:33496] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fe05be55609] [1,1] :[27c929c21716:33496] [18] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe05bf8f133] [1,1] :[27c929c21716:33496] End of error message
I'm not able to understand the difference in behaviour of GPU0 and GPU1. Any help in resolving this would be highly appreciated. Thanks!
@VikashPeddakota999 What's more, please try running sync op before saving beginning and after finishing. Which will prevent one worker has being finished while another worker still in saving process. Example: https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/callbacks.py https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/models.py Also you can use de.models.de_hvd_save_model function to save model.
@MoFHeka Got the same error along with something related to horovod init. can we reopen the issue Please?
[1,0]
:2023-11-27 17:19:28.543524: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at mpi_ops.cc:1604 : FAILED_PRECONDITION: Horovod has not been initialized; use hvd.init(). [1,1]:terminate called after throwing an instance of 'nv::CudaException' [1,1] : what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1] :[27c929c21716:87519] Process received signal [1,1] :[27c929c21716:87519] Signal: Aborted (6) [1,1] :[27c929c21716:87519] Signal code: (-6) [1,1] :[27c929c21716:87519] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f11e0cdd090] [1,1] :[27c929c21716:87519] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f11e0cdd00b] [1,1] :[27c929c21716:87519] [ 2] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f11e0cbc859] [1,1] :[27c929c21716:87519] [ 3] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f11df756911] [1,1] :[27c929c21716:87519] [ 4] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f11df76238c] [1,1] :[27c929c21716:87519] [ 5] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f11df7623f7] [1,1] :[27c929c21716:87519] [ 6] [1,1] :/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f11df7626a9] [1,1] :[27c929c21716:87519] [ 7] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f102a64c7e6] [1,1] :[27c929c21716:87519] [ 8] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f102a660097] [1,1] :[27c929c21716:87519] [ 9] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f102a660f9b] [1,1] :[27c929c21716:87519] [10] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f11af3b5f40] [1,1] :[27c929c21716:87519] [11] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f11bc3bd29f] [1,1] :[27c929c21716:87519] [12] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f11b57cb95c] [1,1] :[27c929c21716:87519] [13] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f11b57cbe59] [1,1] :[27c929c21716:87519] [14] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f11bc8fe0a6] [1,1] :[27c929c21716:87519] [15] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f11bc900eac] [1,1] :[27c929c21716:87519] [16] [1,1] :/usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f11afbe29b5] [1,1] :[27c929c21716:87519] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f11e0c7f609] [1,1] :[27c929c21716:87519] [18] [1,1] :/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f11e0db9133] [1,1] :[27c929c21716:87519] End of error message
I think the confusion stems from this line in the keras horovod demo which sets
tf.config.experimental.set_synchronous_execution(False)
This (at least in my experimentation) makes lines like hvd.join()
not behave like expected, i.e. don't actually set a barrier.
Hi @MoFHeka, I've noticed there might be a few remaining issues. If you have a moment, would you be able to address them?
I think the confusion stems from this line in the keras horovod demo which sets
tf.config.experimental.set_synchronous_execution(False)
This (at least in my experimentation) makes lines like
hvd.join()
not behave like expected, i.e. don't actually set a barrier.
@cmgreen210 I'm not sure if hvd.join() would be disable when set_synchronous_execution. Cause according to this link, this setting only influence SyncExecutors. Operator hvd.join() only use for making sure there are no file conflicts or worker died before finish saving. There is no while_loop or anything else could trigger TensorFlow auto concurrence. And the save/restore OP are all sync kernel , also with many python code without TF eager graph.
I may be wrong, but I think that hvd.join() should not be run before the save is complete.
@VikashPeddakota999 Please try to set a smaller buffer_size parameter of save/restore function.
@MoFHeka it's finally fixed, after removing "tf.config.experimental.set_synchronous_execution(False)" as cmgreen suggested and updating the nvidia drivers. we can close it now.
@VikashPeddakota999 set_synchronous_execution? This is very strange, this option should not cause the GPU kernel to execute asynchronously. Thank you for your efforts.
System information
Describe the bug
model.layers[0].params.save_to_file_system(dirpath="emb_weights_2layers_2g", proc_size=hvd.size(), proc_rank=hvd.rank())
[1,1]<stderr>: what(): tensorflow_recommenders_addons/dynamic_embedding/core/kernels/cuckoo_hashtable_op_gpu.cu.cc:413: CUDA error 2: out of memory [1,1]<stderr>:[e96db920062a:166347] *** Process received signal *** [1,1]<stderr>:[e96db920062a:166347] Signal: Aborted (6) [1,1]<stderr>:[e96db920062a:166347] Signal code: (-6) [1,1]<stderr>:[e96db920062a:166347] [ 0] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f876e93a090] [1,1]<stderr>:[e96db920062a:166347] [ 1] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f876e93a00b] [1,1]<stderr>:[e96db920062a:166347] [ 2] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f876e919859] [1,1]<stderr>:[e96db920062a:166347] [ 3] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f876d3ad911] [1,1]<stderr>:[e96db920062a:166347] [ 4] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f876d3b938c] [1,1]<stderr>:[e96db920062a:166347] [ 5] [1,1]<stderr>:/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f876d3b93f7] [1,1]<stderr>:[e96db920062a:166347] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f876d3b96a9] [1,1]<stderr>:[e96db920062a:166347] [ 7] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN2nv11cuda_check_E9cudaErrorPKci+0x246)[0x7f85ad3a17e6] [1,1]<stderr>:[e96db920062a:166347] [ 8] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons6lookup27CuckooHashTableOfTensorsGpuIlfE20SaveToFileSystemImplEPNS_10FileSystemEmRKSsmbRP11CUstream_st+0x477)[0x7f85ad3b5097] [1,1]<stderr>:[e96db920062a:166347] [ 9] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow_recommenders_addons/dynamic_embedding/core/_cuckoo_hashtable_ops.so(_ZN10tensorflow19recommenders_addons30HashTableSaveToFileSystemGpuOpIlfE7ComputeEPNS_15OpKernelContextE+0x29b)[0x7f85ad3b5f9b] [1,1]<stderr>:[e96db920062a:166347] [10] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x2e0)[0x7f873d00bf40] [1,1]<stderr>:[e96db920062a:166347] [11] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202103247variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x97f)[0x7f874a01381f] [1,1]<stderr>:[e96db920062a:166347] [12] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021032413InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x2dc)[0x7f8743421ecc] [1,1]<stderr>:[e96db920062a:166347] [13] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16AsyncExecuteNode3RunEv+0x189)[0x7f87434223c9] [1,1]<stderr>:[e96db920062a:166347] [14] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor7RunItemESt10unique_ptrINS0_8NodeItemENS_4core15RefCountDeleterEEb+0x456)[0x7f874a554626] [1,1]<stderr>:[e96db920062a:166347] [15] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor3RunEv+0xfc)[0x7f874a55742c] [1,1]<stderr>:[e96db920062a:166347] [16] [1,1]<stderr>:/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x11e29b5)[0x7f873d8389b5] [1,1]<stderr>:[e96db920062a:166347] [17] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f876e8dc609] [1,1]<stderr>:[e96db920062a:166347] [18] [1,1]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f876ea16133]
The embedding size is pretty small, less than 30MB (check the folder size after running it using a single GPU)
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.