google-research / l2p

Learning to Prompt (L2P) for Continual Learning @ CVPR22 and DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning @ ECCV22
https://arxiv.org/pdf/2112.08654.pdf
Apache License 2.0
416 stars 42 forks source link

RESOURCE_EXHAUSTED: Out of memory while trying to allocate # bytes. #20

Closed vgthengane closed 2 years ago

vgthengane commented 2 years ago

Hi, I am trying to run the model on a CIFAR100 dataset. I am getting the following error. I have 4 Tesla V100 GPUs.

2022-08-05 10:00:26.833817: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.00MiB (rounded to 9437184)requested by op 
2022-08-05 10:00:26.835182: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:491] *********************************************************************************x**************x***
2022-08-05 10:00:26.835281: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9437184 bytes.
BufferAssignment OOM Debugging.

The complete running logs can be found here. Please help me with solving the issue.

===============

For your information, I was getting a RuntimeError: Visible devices cannot be modified after being initialized error. Hence, I added the following code snippet in main.py from https://www.tensorflow.org/guide/gpu, and it solved the issue.

"""Main file for running the example."""

import os
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"

imports ...

FLAGS = flags.FLAGS
...

def main(argv):
  del argv

  # Hide any GPUs form TensorFlow. Otherwise TF might reserve memory and make
  # it unavailable to JAX.
  # tf.config.experimental.set_visible_devices([], "GPU")

  gpus = tf.config.list_physical_devices('GPU')
  if gpus:
    # Restrict TensorFlow to only use the first GPU
    try:
      tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
      logical_gpus = tf.config.list_logical_devices('GPU')
      print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
      # Visible devices must be set before GPUs have been initialized
      print(e)

  # if gpus:
  #   # Create 2 virtual GPUs with 1GB memory each
  #   try:
  #     tf.config.set_logical_device_configuration(
  #         gpus[0],
  #         [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
  #         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
  #     logical_gpus = tf.config.list_logical_devices('GPU')
  #     print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  #   except RuntimeError as e:
  #     # Virtual devices must be set before GPUs have been initialized
  #     print(e)

  if FLAGS.exp_id:
     ...