microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Excessive path lengths break certain frameworks (OpenMPI etc.) #36

Closed cauldnz closed 6 years ago

cauldnz commented 6 years ago

This manifests in various ways, but, the most obvious is an issue when using OpenMPI with a multi-layered docker container.

My job definition (Python) looks like this and running DSVM as the base image:

job_name = datetime.utcnow().strftime("keras_%H%M%S")
parameters = models.job_create_parameters.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(cluster.id),
     node_count=1,
     input_directories=input_directories,
     std_out_err_path_prefix=std_output_path_prefix,
     container_settings=models.ContainerSettings(
         models.ImageSourceRegistry(image='tensorflow/tensorflow:1.1.0-gpu')),
     job_preparation=models.JobPreparation(
         command_line="apt update; apt install mpi-default-dev mpi-default-bin -y; pip install horovod; pip install keras; pip install h5py"),
     custom_toolkit_settings = models.CustomToolkitSettings(
         command_line='mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/keras-resnet-horovod.py'))

During execution I get something like this but I have had other situations (memory escapes me) where I have had to shorten things like the Job Name to keep the path lengths down.

Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/data/docker/overlay2/l/K5Y3PIBT3BMKA4G67SR524WSSZ:/data/docker/overlay2/l/75LBQM65UNEZVFOFSBLVEFU6HB:/data/docker/overlay2/l/6NTEL327KYYZSR57FFGAKN64HB:/data/docker/overlay2/l/XECVRKBS6A256PIZI3LJIXOLZB:/data/docker/overlay2/l/EK2JEZCX3BFRMFRDC7UWIDS2YA:/data/docker/overlay2/l/OCUI3355GNZCQQCLADMSY3PIUA:/data/docker/overlay2/l/K2PQLABPXPUOUREGKZU5ZLUY6S:/data/docker/overlay2/l/XBKCDKXUND6YCOFDBK3GPBP6FC:/data/docker/overlay2/l/4CIW4J545FX7JPK2BWS5MU4XDU:/data/docker/'
Unexpected end of /proc/mounts line `overlay2/l/5OBQJ76BJZL6RZJM7Q4LHIO3PQ:/data/docker/overlay2/l/NILD27VDLD5TOQEIPHDCKAXTBU:/data/docker/overlay2/l/5NHG52PCCZA37MLCTHI7DS5WGZ:/data/docker/overlay2/l/QKKZRI27CM2DZXC4BACU6IT2V3:/data/docker/overlay2/l/HOQAN3UEO36UKPZX2FAGAEZOUL:/data/docker/overlay2/l/3VTFWG5J5BVZQB5NSVBEZJDTOL:/data/docker/overlay2/l/FYIYVSWUM52DE4OS2Q4HNQICDG:/data/docker/overlay2/l/R4FAJ5QEEGGK5HJKBPUQN3M2N7:/data/docker/overlay2/l/KM3HKB6UJNQI6QTVOZCAQFERYP:/data/docker/overlay2/l/DBPCK2JQVEI3AFCSBP47MMB4RT:/data/docker/o'
python: can't open file '/keras-resnet-horovod.py': [Errno 2] No such file or directory

Issue is documented here for OpenMPI.

In terms of suggested fix; I think the goal should be to minimize path lengths as much as possible.

Possible approaches... please add more thoughts:

AlexanderYukhanov commented 6 years ago

Hello Chris, "Unexpected end of ..." is very confusing but harmless warning and can be safely ignored.

The problem in your case is "'/keras-resnet-horovod.py': [Errno 2] No such file or directory" and that means that $AZ_BATCHAI_INPUT_SCRIPTS/keras-resnet-horovod.py is expanded in /keras-resnet-horovod.py because $AZ_BATCHAI_INPUT_SCRIPTS is not defined. I suspect you have not specified input directory with id "SCRIPTS" in your job definition and that's why BatchAI has not setup AZ_BATCHAI_INPUT_SCRIPTS environment variable for your job.

Can you please provide the code which you used to populate input_directories variable?

Thanks, Alex

AlexanderYukhanov commented 6 years ago

Hi Chris, Did my answer help?

cauldnz commented 6 years ago

Hi Alexander. Yes. All sorted thanks. I did still have some issues with path lengths when I used a longer Job name. I will try and repro but closing this issue. Thanks lots for your help.