Distributed Tuning CUDA_ERROR_OUT_OF_MEMORY

Astlaan commented 4 years ago

Hello,

I'm trying to use Keras Tuner in a distributed tuning fashion. I did as explained in the docs, and ran the following commands:

export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8894"
nohup python run_tuning.py & 
export KERASTUNER_TUNER_ID="tuner0"
nohup python run_tuning.py > tuner0.out &
export KERASTUNER_TUNER_ID="tuner1"
nohup python run_tuning.py > tuner1.out &

It turns out tuner0 runs fine, however, tuner1 gets the following problem (printed in tuner1.out):

2020-06-21 18:36:16.160951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:0a:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-06-21 18:36:16.162099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:0b:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-06-21 18:36:16.162992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-21 18:36:16.167964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-21 18:36:16.171693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-21 18:36:16.172826: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-21 18:36:16.177418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-21 18:36:16.179454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-21 18:36:16.187249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-21 18:36:16.190483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-06-21 18:36:16.190887: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-21 18:36:16.211675: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1997660000 Hz
2020-06-21 18:36:16.221177: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559632bbffe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-21 18:36:16.221235: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-21 18:36:16.367849: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11996954624

I am using a Nvidia Tesla K80 accelerator (which has a dual-gpu design). This is the output of nvidia-smi:

Sun Jun 21 18:46:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   78C    P0   105W / 149W |  11388MiB / 11441MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |    130MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      5555      C   python                                     10929MiB |
|    0      5622      C   python                                       445MiB |
|    1      5555      C   python                                        58MiB |
|    1      5622      C   python                                        58MiB |
+-----------------------------------------------------------------------------+

The batch size I'm using is 4096. Any ideas?

yixingfu commented 4 years ago

Did you set distribution_strategy=tf.distribute.MirroredStrategy()? It seems both tuners are trying to put all load in GPU:0.

In general the chief-worker model is intended for a cluster of machines, instead of multiple accelerator in one machine. For your case you may want to try using only data parallelism (i.e. set distribution_strategy) and only run one tuner with doubled batch size.

Astlaan commented 4 years ago

Did you set distribution_strategy=tf.distribute.MirroredStrategy()? It seems both tuners are trying to put all load in GPU:0. Indeed I did not, as my goal was not to use data parallelism but only tuning parallelism.

-So is there no way to do distributed tuning in one machine with multiple GPUs? Maybe it can be done by, in each tuner file run, manually selecting the gpu?, such as:

export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8894"
nohup python run_tuning.py & 
export KERASTUNER_TUNER_ID="tuner0"
nohup python run_tuning.py gpu0 > tuner0.out &
export KERASTUNER_TUNER_ID="tuner1"
nohup python run_tuning.py gpu1 > tuner1.out &

Where the gpu's were passed as an argument, so that inside the file tensorflow commands are run to lock the gpu? Is there an easier way?

For the mirrored strategy you proposed. Will I have to adjust batch_size and learning_rate manually?

yixingfu commented 4 years ago

Manually selecting a single GPU for different tuner should certainly work.

I am not certain what would happen when simply adding mirrored strategy; however I think batch_size and learning_rate should not need to be changed. For your example if turn on mirrored strategy I think it will just run batch per gpu = 2048 for each tuner.

keras-team / keras-tuner

Distributed Tuning CUDA_ERROR_OUT_OF_MEMORY #329