keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 396 forks source link

Distributed Tuning CUDA_ERROR_OUT_OF_MEMORY #329

Open Astlaan opened 4 years ago

Astlaan commented 4 years ago

Hello,

I'm trying to use Keras Tuner in a distributed tuning fashion. I did as explained in the docs, and ran the following commands:

export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8894"
nohup python run_tuning.py & 
export KERASTUNER_TUNER_ID="tuner0"
nohup python run_tuning.py > tuner0.out &
export KERASTUNER_TUNER_ID="tuner1"
nohup python run_tuning.py > tuner1.out &

It turns out tuner0 runs fine, however, tuner1 gets the following problem (printed in tuner1.out):

2020-06-21 18:36:16.160951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:0a:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-06-21 18:36:16.162099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:0b:00.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-06-21 18:36:16.162992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-21 18:36:16.167964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-21 18:36:16.171693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-21 18:36:16.172826: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-21 18:36:16.177418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-21 18:36:16.179454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-21 18:36:16.187249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-21 18:36:16.190483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-06-21 18:36:16.190887: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-21 18:36:16.211675: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1997660000 Hz
2020-06-21 18:36:16.221177: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559632bbffe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-21 18:36:16.221235: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-21 18:36:16.367849: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11996954624

I am using a Nvidia Tesla K80 accelerator (which has a dual-gpu design). This is the output of nvidia-smi:

Sun Jun 21 18:46:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   78C    P0   105W / 149W |  11388MiB / 11441MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |    130MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      5555      C   python                                     10929MiB |
|    0      5622      C   python                                       445MiB |
|    1      5555      C   python                                        58MiB |
|    1      5622      C   python                                        58MiB |
+-----------------------------------------------------------------------------+

The batch size I'm using is 4096. Any ideas?

yixingfu commented 4 years ago

Did you set distribution_strategy=tf.distribute.MirroredStrategy()? It seems both tuners are trying to put all load in GPU:0.

In general the chief-worker model is intended for a cluster of machines, instead of multiple accelerator in one machine. For your case you may want to try using only data parallelism (i.e. set distribution_strategy) and only run one tuner with doubled batch size.

Astlaan commented 4 years ago

Did you set distribution_strategy=tf.distribute.MirroredStrategy()? It seems both tuners are trying to put all load in GPU:0. Indeed I did not, as my goal was not to use data parallelism but only tuning parallelism.

-So is there no way to do distributed tuning in one machine with multiple GPUs? Maybe it can be done by, in each tuner file run, manually selecting the gpu?, such as:

export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8894"
nohup python run_tuning.py & 
export KERASTUNER_TUNER_ID="tuner0"
nohup python run_tuning.py gpu0 > tuner0.out &
export KERASTUNER_TUNER_ID="tuner1"
nohup python run_tuning.py gpu1 > tuner1.out &

Where the gpu's were passed as an argument, so that inside the file tensorflow commands are run to lock the gpu? Is there an easier way?

yixingfu commented 4 years ago

Manually selecting a single GPU for different tuner should certainly work.

I am not certain what would happen when simply adding mirrored strategy; however I think batch_size and learning_rate should not need to be changed. For your example if turn on mirrored strategy I think it will just run batch per gpu = 2048 for each tuner.