Closed Keegil closed 6 years ago
Please take a look at command line which is used for launching workers: "mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py" if you need some environment variable been available to each worker you need to provide it's value explicitly in mpirun call. But i would suggest to pass output directory via command line arguments (e.g. python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py --output=$AZ_BATCHAI_OUTPUT_MODEL) instead.
Thanks a lot; the latter solution you suggested works very well!
I've modified the Horovod recipe to train a Keras model, and I'm having trouble saving results and models to the file share because the AZ_BATCHAI_INPUT_x and AZ_BATCHAI_OUTPUT_x variables are either not being set or aren't accessible in the python kernel.
According to the documentation of azure.mgmt.batchai.models.JobCreateParameters: "Batch AI service sets the following environment variables for all jobs: AZ_BATCHAI_INPUT_id, AZ_BATCHAI_OUTPUT_id, AZ_BATCHAI_NUM_GPUS_PER_NODE."
And according to the documentation of azure.mgmt.batchai.models.OutputDirectory: "The name for the output directory. It will be available for the job as an environment variable under AZ_BATCHAI_OUTPUT_id."
However:
This is the code for configuring the job: