Open Astlaan opened 4 years ago
Did you set distribution_strategy=tf.distribute.MirroredStrategy()
? It seems both tuners are trying to put all load in GPU:0.
In general the chief-worker model is intended for a cluster of machines, instead of multiple accelerator in one machine. For your case you may want to try using only data parallelism (i.e. set distribution_strategy) and only run one tuner with doubled batch size.
Did you set
distribution_strategy=tf.distribute.MirroredStrategy()
? It seems both tuners are trying to put all load in GPU:0. Indeed I did not, as my goal was not to use data parallelism but only tuning parallelism.
-So is there no way to do distributed tuning in one machine with multiple GPUs? Maybe it can be done by, in each tuner file run, manually selecting the gpu?, such as:
export KERASTUNER_TUNER_ID="chief"
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_ORACLE_PORT="8894"
nohup python run_tuning.py &
export KERASTUNER_TUNER_ID="tuner0"
nohup python run_tuning.py gpu0 > tuner0.out &
export KERASTUNER_TUNER_ID="tuner1"
nohup python run_tuning.py gpu1 > tuner1.out &
Where the gpu's were passed as an argument, so that inside the file tensorflow commands are run to lock the gpu? Is there an easier way?
Manually selecting a single GPU for different tuner should certainly work.
I am not certain what would happen when simply adding mirrored strategy; however I think batch_size and learning_rate should not need to be changed. For your example if turn on mirrored strategy I think it will just run batch per gpu = 2048 for each tuner.
Hello,
I'm trying to use Keras Tuner in a distributed tuning fashion. I did as explained in the docs, and ran the following commands:
It turns out
tuner0
runs fine, however,tuner1
gets the following problem (printed in tuner1.out):I am using a Nvidia Tesla K80 accelerator (which has a dual-gpu design). This is the output of nvidia-smi:
The batch size I'm using is 4096. Any ideas?