microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Output directory environment variables are not being set #27

Closed Keegil closed 6 years ago

Keegil commented 6 years ago

I've modified the Horovod recipe to train a Keras model, and I'm having trouble saving results and models to the file share because the AZ_BATCHAI_INPUT_x and AZ_BATCHAI_OUTPUT_x variables are either not being set or aren't accessible in the python kernel.

According to the documentation of azure.mgmt.batchai.models.JobCreateParameters: "Batch AI service sets the following environment variables for all jobs: AZ_BATCHAI_INPUT_id, AZ_BATCHAI_OUTPUT_id, AZ_BATCHAI_NUM_GPUS_PER_NODE."

And according to the documentation of azure.mgmt.batchai.models.OutputDirectory: "The name for the output directory. It will be available for the job as an environment variable under AZ_BATCHAI_OUTPUT_id."

However:

This is the code for configuring the job:


parameters = models.job_create_parameters.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(cluster.id),
     node_count=4,
     input_directories=[
       models.InputDirectory(id='SCRIPTS', path='$AZ_BATCHAI_MOUNT_ROOT/{0}/scripts'.format(azure_file_share))
     ],
     output_directories=[
       models.OutputDirectory(id='MODELS', path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share), path_suffix='models'),
       models.OutputDirectory(id='RESULTS', path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share), path_suffix='results')
     ],
     std_out_err_path_prefix="$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share),
     container_settings=models.ContainerSettings(
         models.ImageSourceRegistry(image='tensorflow/tensorflow:1.4.0-gpu-py3')),
     job_preparation=models.JobPreparation(
         command_line="apt update; apt install mpi-default-dev mpi-default-bin -y; pip install azure; pip install horovod; pip install keras; pip install h5py"),
     custom_toolkit_settings = models.CustomToolkitSettings(
         command_line='mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/dev-DeepAttach-trainrater-dist-py3.py'))
AlexanderYukhanov commented 6 years ago

Please take a look at command line which is used for launching workers: "mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py" if you need some environment variable been available to each worker you need to provide it's value explicitly in mpirun call. But i would suggest to pass output directory via command line arguments (e.g. python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py --output=$AZ_BATCHAI_OUTPUT_MODEL) instead.

Keegil commented 6 years ago

Thanks a lot; the latter solution you suggested works very well!