Closed EdwardZhang88 closed 6 years ago
I have upgraded tensor2tensor and tensorflow to v1.8 and v1.10 respectively, but the same problem still persists. I am wondering if anybody else has run into the same issue. This is easily causing CUDA out_of_memory issue and I just don't know what a better option is other than allocating all 4 GPU to tensor2tensor at a time.
Finally figure it out by updating the gpu_options in t2t-trainer.py, e.g., config.gpu_options.visible_device_list=str(my_rank) Apparently, this is a better way than using CUDA_VISIBLE_DEVICES as the latter will usually prohibit CUDA IPC.
Description
I am running tensor2tensor in a Kubernetes container environment. I find that no matter how many GPU I allocate to the container, tensor2tensor will always be able to use up all GPU on the node. For example, when I assign only one GPU to the container and also 1 to the --worker_gpu in my training script, all 4 GPU on the node will be visible and the transformer model variables will fill up the memory of all 4 GPU. What is tricky is that the actual computation will only happen on one GPU though. I think it's most likely to be an issue with tensor2tensor as the GPU limit does get honored if I switch to a plain tensorflow container.
Below is the training script I submitted, python -u t2t-trainer --model=transformer --hparams_set=$HPARAMS --problems=$PROBLEM --t2t_usr_dir=$t2t_usr_dir --data_dir=$data_dir --output_dir=$model_dir --save_checkpoints_secs 1800 --train_steps 45000 --worker_gpu=1
Below is the output of nvidia-smi command on the GPU node when the tensor2tensor container is running, +-----------------------------------------------------------------------------+ | NVIDIA-SMI 381.04 Driver Version: 381.04 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K40m Off | 0000:03:00.0 Off | 0 | | N/A 54C P0 148W / 235W | 10999MiB / 11439MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40m Off | 0000:04:00.0 Off | 0 | | N/A 35C P0 63W / 235W | 10953MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40m Off | 0000:82:00.0 Off | 0 | | N/A 37C P0 61W / 235W | 10953MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K40m Off | 0000:83:00.0 Off | 0 | | N/A 37C P0 63W / 235W | 10953MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 71703 C python 10982MiB | | 1 71703 C python 10936MiB | | 2 71703 C python 10936MiB | | 3 71703 C python 10936MiB | +-----------------------------------------------------------------------------+
Environment information
tensor2tensor v1.2.9