tensor2tensor breaks GPU limit in a container environment

EdwardZhang88 commented 6 years ago

Description

I am running tensor2tensor in a Kubernetes container environment. I find that no matter how many GPU I allocate to the container, tensor2tensor will always be able to use up all GPU on the node. For example, when I assign only one GPU to the container and also 1 to the --worker_gpu in my training script, all 4 GPU on the node will be visible and the transformer model variables will fill up the memory of all 4 GPU. What is tricky is that the actual computation will only happen on one GPU though. I think it's most likely to be an issue with tensor2tensor as the GPU limit does get honored if I switch to a plain tensorflow container.

Below is the training script I submitted, python -u t2t-trainer --model=transformer --hparams_set=$HPARAMS --problems=$PROBLEM --t2t_usr_dir=$t2t_usr_dir --data_dir=$data_dir --output_dir=$model_dir --save_checkpoints_secs 1800 --train_steps 45000 --worker_gpu=1

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 71703 C python 10982MiB | | 1 71703 C python 10936MiB | | 2 71703 C python 10936MiB | | 3 71703 C python 10936MiB | +-----------------------------------------------------------------------------+

Environment information

tensor2tensor v1.2.9

OS: CentOS 7.2
Container OS: Ubuntu 16.04

$ pip freeze | grep tensor
tensorboard==1.9.0
tensorflow-gpu==1.4.1
tensorflow-tensorboard==0.4.0

$ python -V
Python 2.7.12

EdwardZhang88 commented 6 years ago

I have upgraded tensor2tensor and tensorflow to v1.8 and v1.10 respectively, but the same problem still persists. I am wondering if anybody else has run into the same issue. This is easily causing CUDA out_of_memory issue and I just don't know what a better option is other than allocating all 4 GPU to tensor2tensor at a time.

EdwardZhang88 commented 6 years ago

Finally figure it out by updating the gpu_options in t2t-trainer.py, e.g., config.gpu_options.visible_device_list=str(my_rank) Apparently, this is a better way than using CUDA_VISIBLE_DEVICES as the latter will usually prohibit CUDA IPC.

tensorflow / tensor2tensor

tensor2tensor breaks GPU limit in a container environment #1012

Description

Environment information