microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Saving files #24

Closed Nimi42 closed 6 years ago

Nimi42 commented 6 years ago

1.

I tried to use the Horovod recipe. The std output works just fine but I can't seem to save the model.

What do I have do I have to do to save the files to some output dir on the storage?

The job.json defines an output directory, but it stays empty even after a successful run.

"outputDirectories": [
      {
        "createNew": true,
        "id": "MODEL",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
        "pathSuffix": "Models",
        "type": "custom"
      }
],

I tried saving something to

'./MODEL/example.txt'

explicitly, but that also did not work. What am I missing?

EDIT: I checked the VM and the results are definitely there somewhere in /mnt/batch/tasks/workitems... I think. Should I save them by hand to _$AZ_BATCHAI_MOUNTROOT or how do I get these files into the storage?


  1. Until now I thought I have to start mpirun with the number of processes and servers that I want to use. How come the mpirun from the job.json does not define any such things?

e.g.

"commandLine": "mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py"
llidev commented 6 years ago

Hi,

  1. The output directory setting creates an unique job output directory in your fileshare storage, and you can use it in the job via environment variable $AZ_BATCHAIOUTPUT (in your case, $AZ_BATCHAI_OUTPUT_MODEL)

Your training script will be responsible to save the model file to a specified destination. In the horovod recipe, we use the official horovod sample tensorflow_mnist.py, where checkpoint is saved to:

if hvd.rank() == 0:
    callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

You have to modify the script to something like:

if hvd.rank() == 0: 
    callbacks.append(keras.callbacks.ModelCheckpoint(os.path.join(os.environ['AZ_BATCHAI_OUTPUT_MODEL'], 'checkpoint-{epoch}.h5')))

Then you should be able to see your model output in your share.

  1. We use MPI host file instead of specifying the number of process

--hostfile $AZ_BATCHAI_MPI_HOST_FILE

The file is auto-generated by Batch AI in the format of:

host1 #proc max_slot host2 #proc max_slot ...

kishanakbari21 commented 6 years ago

Hi, You got answer with reference to keras recipe? I mean how can I get model file as well as tensorboard log files using keras recipe?