BLEURT consumes all available memory on checkpoint load?

versipellis commented 2 years ago

Not quite sure what's happening here - running CUDA 11.6 and TensorFlow 2.10.0. No matter what checkpoint I use, all available GPU memory is consumed. Minimum reproducible example here:

bleurtcheckpoints = os.path.join(os.getcwd(), "bleurtcktpts")
from bleurt import score
checkpoint = os.path.join(bleurtcheckpoints, "bleurt-tiny-128/")
scorer = score.BleurtScorer(checkpoint)

INFO:tensorflow:Reading checkpoint /data/visualization/vis-text/datasets/vis-text/bleurtcktpts/bleurt-tiny-128/.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.

2022-10-25 21:53:54.567812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.690050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.691074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.692770: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-25 21:53:54.693762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.694475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.695136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.420924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.422112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.423287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.424176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11413 MB memory:  -> device: 0, name: NVIDIA TITAN Xp, pci bus id: 0000:00:05.0, compute capability: 6.1

INFO:tensorflow:BLEURT initialized.

nvidia-smi results after (right before, it's only showing 4 MiB in use):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     On   | 00000000:00:05.0 Off |                  N/A |
| 23%   30C    P2    58W / 250W |  11697MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16853      C   /usr/bin/python3.8              11693MiB |
+-----------------------------------------------------------------------------+

vergilus commented 4 months ago

you can initialize the GPU device for TensorFlow loading with dynamic memory allocation:

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

Then, you can specify and device for bleurt loading:

with tf.device(f"GPU:{rank}"):
    self.scorer = score.BleurtScorer(args.bleurt_ckpt)

note that the tensorflow loading involves all the visible device, therefore you have to set all visible device to dynamic memory allocation.

versipellis commented 4 months ago

Closing as issue not relevant anymore - sorry and thanks!

google-research / bleurt

BLEURT consumes all available memory on checkpoint load? #54