Open Julia90 opened 2 years ago
Hey @Julia90 Were you able to solve the issue? I am facing a similar one.
Hi @TahaRazzaq, I don't think I have found the solution. But I manually reduced the size of the batch or the dataset to be sent for calculation, and then it worked. I didn't check into the exact reason for this issue. But it seems to be the limit of memory hardware. I feel like using this transformer encoding, it's so easy to explode the memory. Just a piece of side information. Many researchers working on Transformer are using mix-precision or half-precision during training. I guess it's related to this issue.
I did try reducing the batch size and that didn't really help. Will check if the memory is exploding at some point by tweaking the dataset. Thank you!
I've encountered similar issues in the past, and I'd love to share some steps that have worked for me when dealing with the CUDA and cuDNN initialization errors, especially on AWS EC2 instances like the NVIDIA A10G. Here’s some set of solutions which you can try to potentially resolve these errors :)
1.Update NVIDIA Drivers and CUDA/cuDNN First, make sure your NVIDIA drivers and your CUDA/cuDNN installations are up to date and compatible with the TensorFlow/JAX versions you’re using. Here’s how I usually update my NVIDIA drivers on an Ubuntu system:
# Check your current driver version
nvidia-smi
# Update to the recommended driver version
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-driver-510
# Reboot to load the new driver
sudo reboot
# After rebooting, check the updated driver version
nvidia-smi
2. Verify CUDA and cuDNN Setup It's crucial that your CUDA and cuDNN paths are correctly set so that TensorFlow or JAX can locate them. I often ensure they’re set like this:
# Add CUDA and cuDNN to your path
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/include
export CUDA_HOME=/usr/local/cuda
3. Optimize GPU Memory Usage GPU memory can get filled up quickly, which might be part of the problem. Here’s how I manage TensorFlow's memory usage to prevent it from hogging all available GPU memory:
import tensorflow as tf
# Check available GPU devices
print("Available GPU devices:", tf.config.list_physical_devices('GPU'))
# Enable memory growth for each GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
4. Implement Mixed Precision Training Using mixed precision training has been a game-changer for running larger models or batches. It can significantly reduce memory consumption:
from tensorflow.keras.mixed_precision import experimental as mixed_precision
# Set up mixed precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
# Your model will now automatically utilize mixed precision
5. Dynamically Adjust Batch Size If you're still hitting memory limits, you might want to dynamically adjust your batch sizes based on the available memory:
def adjust_batch_size(available_memory, model_memory_usage, base_batch_size):
"""
Adjust the batch size based on available GPU memory and the model's memory usage
"""
max_batch_size = int(available_memory / model_memory_usage)
return min(max_batch_size, base_batch_size)
# Example usage
current_batch_size = adjust_batch_size(23028, 22723, 64) # Example values
print("Adjusted Batch Size:", current_batch_size)
Hope this helps!
Virtual Machine: AWS EC2 g5.2xlarge, NVIDIA A10G GPU, 24G GPU memory When running vit_jax.ipynb, get_accuracy(params_repl) in the Evaluation section, I got the following error:
I noticed that the GPU memory is almost out by nvidia-smi: Every 0.1s: nvidia-smi ip-10-116-53-149: Thu Jul 14 03:53:53 2022
Thu Jul 14 03:53:53 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 32C P0 69W / 300W | 22727MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1721 C ...tu/vit/vit_env/bin/python 22723MiB | +-----------------------------------------------------------------------------+
Here are my packages:
I've spent almost a week on this issue, not finding a solution. Could anyone help?