Closed Nimi42 closed 6 years ago
Hi,
The output directory setting creates an unique job output directory in your fileshare storage, and you can use it in the job via environment variable $AZ_BATCHAIOUTPUT
Your training script will be responsible to save the model file to a specified destination. In the horovod recipe, we use the official horovod sample tensorflow_mnist.py, where checkpoint is saved to:
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
You have to modify the script to something like:
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint(os.path.join(os.environ['AZ_BATCHAI_OUTPUT_MODEL'], 'checkpoint-{epoch}.h5')))
Then you should be able to see your model output in your share.
We use MPI host file instead of specifying the number of process
--hostfile $AZ_BATCHAI_MPI_HOST_FILE
The file is auto-generated by Batch AI in the format of:
host1 #proc max_slot host2 #proc max_slot ...
Hi, You got answer with reference to keras recipe? I mean how can I get model file as well as tensorboard log files using keras recipe?
1.
I tried to use the Horovod recipe. The std output works just fine but I can't seem to save the model.
What do I have do I have to do to save the files to some output dir on the storage?
The job.json defines an output directory, but it stays empty even after a successful run.
I tried saving something to
explicitly, but that also did not work. What am I missing?
EDIT: I checked the VM and the results are definitely there somewhere in /mnt/batch/tasks/workitems... I think. Should I save them by hand to _$AZ_BATCHAI_MOUNTROOT or how do I get these files into the storage?
e.g.