tensorflow not utilizing gpu memory and stating limit is 137gb

ripjohnbrown1859 commented 1 month ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.16.1

Custom code

Yes

OS platform and distribution

wsl ubuntu 22.04

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

8.6.0

GPU model and memory

sli titan x maxwell

Current behavior?

i have 2 titan x maxwells and am trying to run a CNN on my machine in wsl2, however when I try to run it i get the attached error, which seems to indicate tensorflow is not using any gpu memory and then throwing an error. it runs out of memory while converting the training input tensor

Standalone code to reproduce the issue

__name__ == '__main__':
    parent_directory = 'physionet.org/files/ecg-arrhythmia/1.0.0/WFDBRecords'
    input_dataset, output_dataset = get_data(parent_directory)
    print('padding sequences')        
    output_dataset = pad_sequences(output_dataset, maxlen=64, padding = "post")
    print(len(input_dataset))
    #output_dataset = to_categorical(output_dataset, num_classes = 64)
    training_input_dataset = input_dataset[:40000]
    training_output_dataset = output_dataset[:40000]
    val_input_dataset = input_dataset[40000:]
    val_output_dataset = output_dataset[40000:]
    #dataset = process_dataset(dataset)
    print(output_dataset[1])
    print('converting training input data to tensor') 
    training_input_data = tf.convert_to_tensor(training_input_dataset,dtype=tf.float32)
    print('converting training output data to tensor')
    training_output_data = tf.convert_to_tensor(training_output_dataset,dtype=tf.float32)
    print('converting validation input data to tensor')
    val_input_data = tf.convert_to_tensor(val_input_dataset,dtype=tf.float32)
    print('converting val;idation output data to tensor')
    val_output_data = tf.convert_to_tensor(val_output_dataset,dtype=tf.float32)

    training_input_data = np.expand_dims(training_input_data, axis=-1)
    val_input_data = np.expand_dims(val_input_data, axis=-1)
    training_output_data = np.expand_dims(training_output_data, axis=-1)
    val_output_data = np.expand_dims(val_output_data, axis=-1)
    #training_output_data = np.expand_dims(training_output_data, axis=-1)
    #val_output_data = np.expand_dims(val_output_data, axis=-1)
    print(training_output_data.shape)
    print(training_input_data.shape)
    #training_output_data = tf.reshape(training_output_data, 32, 5000, 12)
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(64, (1, 1), activation='relu', input_shape=( 12, 5000, 1)), # Adjust input shape based on your data
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Conv2D(128, (1, 1), activation='relu'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        #tf.keras.layers.Conv2D(128, (1, 1), activation='relu'),
        #tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Flatten(),
        #tf.keras.layers.Dense(64, activation='relu'),
        #tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(64)  # Number of classes
    ])
    print(model.summary())
    model.compile(optimizer='adam', loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Train the model
    num_epochs = 10
    model.fit(training_input_data, training_output_data, batch_size=32, epochs=num_epochs, validation_data=(val_input_data, val_output_data))

    test_loss, test_acc = model.evaluate(val_input_data, val_output_data, verbose=2)
    print('\nTest accuracy:', test_acc)
    #  export NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))

Relevant log output

2024-03-29 16:39:36.771397: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1629] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-03-29 16:39:36.771454: W external/local_xla/xla/stream_executor/integrations/device_host_allocator.h:51] could not allocate pinned host memory of size: 2147483648
2024-03-29 16:39:37.984270: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1629] failed to alloc 1932735232 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-03-29 16:39:37.984317: W external/local_xla/xla/stream_executor/integrations/device_host_allocator.h:51] could not allocate pinned host memory of size: 1932735232
2024-03-29 16:39:39.200352: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1629] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-03-29 16:39:39.200399: W external/local_xla/xla/stream_executor/integrations/device_host_allocator.h:51] could not allocate pinned host memory of size: 2147483648
2024-03-29 16:39:50.425728: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1629] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-03-29 16:39:50.425788: W external/local_xla/xla/stream_executor/integrations/device_host_allocator.h:51] could not allocate pinned host memory of size: 2147483648
2024-03-29 16:39:51.809074: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1629] failed to alloc 2147483648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-03-29 16:39:51.809138: W external/local_xla/xla/stream_executor/integrations/device_host_allocator.h:51] could not allocate pinned host memory of size: 2147483648
2024-03-29 16:39:51.809162: W external/local_tsl/tsl/framework/bfc_allocator.cc:487] Allocator (gpu_host_bfc) ran out of memory trying to allocate 1.77GiB (rounded to 1896960000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2024-03-29 16:39:51.809173: I external/local_tsl/tsl/framework/bfc_allocator.cc:1044] BFCAllocator dump for gpu_host_bfc
2024-03-29 16:39:51.809200: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (256):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809228: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (512):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809236: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (1024):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809242: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (2048):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809247: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (4096):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809252: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (8192):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809257: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (16384):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809264: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (32768):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809287: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (65536):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809297: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (131072):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809320: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (262144):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809329: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (524288):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809352: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (1048576):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809360: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (2097152):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809365: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (4194304):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809371: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (8388608):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809376: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (16777216):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809381: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (33554432):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809386: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (67108864):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809391: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (134217728):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809397: I external/local_tsl/tsl/framework/bfc_allocator.cc:1051] Bin (268435456):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2024-03-29 16:39:51.809404: I external/local_tsl/tsl/framework/bfc_allocator.cc:1067] Bin for 1.77GiB was 256.00MiB, Chunk State: 
2024-03-29 16:39:51.809426: I external/local_tsl/tsl/framework/bfc_allocator.cc:1105]      Summary of in-use Chunks by size: 
2024-03-29 16:39:51.809436: I external/local_tsl/tsl/framework/bfc_allocator.cc:1112] Sum Total of in-use chunks: 0B
2024-03-29 16:39:51.809442: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Total bytes in pool: 0 memory_limit_: 137438953472 available bytes: 137438953472 curr_region_allocation_bytes_: 2147483648
2024-03-29 16:39:51.809449: I external/local_tsl/tsl/framework/bfc_allocator.cc:1119] Stats: 
Limit:                    137438953472
InUse:                               0
MaxInUse:                            0
NumAllocs:                           0
MaxAllocSize:                        0
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-03-29 16:39:51.809472: W external/local_tsl/tsl/framework/bfc_allocator.cc:499] <allocator contains no memory>
Segmentation fault

tilakrayal commented 1 month ago

@ripjohnbrown1859, Could you please provide the steps you have followed to install the tensorflow 2.16 and also provide the CUDA, cudNN, Bazel, Environment where you are trying which helps us to analyse the issue in an effective way. Thank you!

ripjohnbrown1859 commented 1 month ago

i managed to fix this specific problem by processing the data on the cpu and under 'with tf.device('/CPU:0'):' and training under 'with strategy'. now i have a problem where every other epoch reports a bunch of rendezvous errors, skips abunch of data, and gives a val accuracy of 1 and an increasingly high loss. also my epochs are reporting greater than 1 accuracy. Should i open a new issue?

tilakrayal commented 3 weeks ago

@ripjohnbrown1859, Glad the GPU issue was resolved. For the val loss, If the validation loss (error) is going to increase so means overfitting. You must set the number of epochs as high as possible and avoid the overfitting and terminate training based on the error rates. . As long as it keeps dropping training should continue. Till model start to converge at nth epochs. Indeed it should converge quite well to a low val_loss.

Also please take a look at this references. https://discuss.tensorflow.org/t/why-does-my-validation-loss-increase-but-validation-accuracy-perfectly-matches-training-accuracy/4283

Reduce your learning rate to a very small number like 0.001 or even 0.0001.
Provide more data.
Set Dropout rates to a number like 0.2. Keep them uniform across the network.
Try decreasing the batch size.
Using appropriate optimizer: You may need to experiment a bit on this. Use different optimizers on the same network, and select an optimizer which gives you the least loss.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 1 week ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / tensorflow