Open damienpontifex opened 6 years ago
Hi Damien, Thank you for the feedback! We will figure out how to make use of TF_CONFIG for tensorflow framework. At first glance, we can just introduce dedicated environment variables for using with TF_CONFIG.
Great, thanks for the response. It is a JSON serialised dictionary in an environment variable, but would mean distributed training would ‘just work ™️’.
Looking at Azure Batch AI environment variables it seems this is now available.
Sorry, the functionality is not released yet.
May I ask if anyone or @damienpontifex know what is the env variable for the master host? I encounter "valueerror: if "cluster" is set in tf_config, it must have one "chief" node." you can reference here the chief node in the cluster spec https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig
@wtam my understanding is you must have the task type and index set appropriately for the chief. In the page you linked this is {'cluster': cluster, 'task': {'type': 'chief', 'index': 0}})
where the cluster
variable has three keys: chief, ps and worker.
Without seeing your actual code, seems the minimum requirement is cluster
to have {'chief': ['host0:2222']}
. You can have a look at the logic in RunConfig to see if there's a case with your setup that you have configured wrong.
@damienpontifex Thanks so much for the respond. Since BatchAI only have these env var on BatchAI, $AZ_BATCHAI_PS_HOSTS, $AZ_BATCHAI_WORKER_HOSTS & $AZ_BATCHAI_TASK_INDEX. I overcome the chief node define issueabove by manually reserve the 1st workerhost as chief node and put into the cluster spec. Now I move a bit forward but encounter another issue from RunConfig below: ValueError: worker is not a valid task_type in the cluster_spec???????: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd35049750> Not sure where goes wrong? My cluster is 3 nodes, 1 node reserved for PS and chief node and another 2 nodes for worker node. Appreciate any comment or suggestion to help me out.
This is the cluster spec for the failed worker '{"cluster": {"chief": ["10.0.0.4:2223"], "worker_hosts": ["10.0.0.5:2222", "10.0.0.6:2222"], "ps_hosts": ["10.0.0.4:2222"]}, "task": {"index": "1", "type": "worker"}}')
Stupid mistake I made on the cluster spec naming, RunConfig is trying to find worker from my worker_hosts and that why I got the ValueError. For people play around with Estimator Distributed GPU on BatchAI, better wait for its support as the away I reserved the worker node also require me to deduct the $AZ_BATCHAI_TASK_INDEX manually in the cluster spec for the workers.
Hi @damienpontifex, maybe you have already known that, Batch AI is now automatically generating TF_CONFIG env var when running tensorflow job. Would you please try it out and please let us know if it works for you? Thanks!
Hi @lliimsft, I'm seeing the automatically generated TF_CONFIG env var with nodeCount 1 as:
{'task': {'type': 'master', 'index': 0}, 'cluster': {'ps': [''], 'worker': ['10.0.0.4:2222']}, 'environment': 'cloud'}
which doesn't seem to work in this 1-node cluster scenario?
Getting this error when running with nodeCount=3 in the stderr-ps-0.txt log
_"ValueError: If "cluster" is set in TFCONFIG, it must have one "chief" node."
For this task the TF_CONFIG variable was:
{'cluster': {'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}
The worker logs just had "Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts."
I put the code I'm running here https://github.com/damienpontifex/BatchAIMnist
From the repo, I do:
sh prepare-cluster.sh
sh data-prep.sh
# Wait until data prep done
sh train.sh
Looking at the documentation, wondering whether the TF_CONFIG value should be:
On the parameter server:
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}
On the chief
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'chief', 'index': 0}, 'environment': 'cloud'}
I can't seem to find guidance on having all of chief, ps and worker on the same machine etc as the docstring https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/run_config.py#L351-L376 seems to have them all as separate machines.
How can we assist to test and get this working?
@lliimsft, @AlexanderYukhanov, can we please get some update on this? :)
@damienpontifex @yangsiyu007 The TF_CONFIG environment variable offered by Batch AI is based on TensorFlow Trainer Development Considerations, where the cluster only contains ps/worker, and task type will be master, worker, or ps. Although according to run_config.py, tensorflow is now accepting more options such as "chief", which seems to be confused for us (not sure how it differentiates from "master"). We are looking at this.
Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.
Hello guys, just wondering if Batch AI is generating the new format of TF_CONFIG now?
I don't think so - not when I tried it the week before last... @lliimsft updates?
@yangsiyu007 @awan-10 This work is still in progress. We will keep you updated in this post.
I was looking at what is currently being set and what changes need to make the RunConfig
parse it correctly. Investigations outlined below and will look into updating the TF_CONFIG variable on each machine through code to ensure this change is successful. @lliimsft could this below help in making the appropriate changes?
To verify what JSON structure worked I setup:
os.environ['TF_CONFIG'] = TF_CONFIG_JSON_STRING
config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))
Run with a 3 node job configured with 1 parameter server and 3 worker count
Currently in Batch AI we get the TF_CONFIG environment variable being: In ps-0
{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}
wk-0
{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}
wk-1
{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}
wk-2
{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":2},"environment":"cloud"}
With these, the python code above gave the error:
ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node.
To get this working, we apparently need the master worker defined under chief in the cluster. As such, the 'cluster' part of the JSON object would become:
"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]}
Then the task component would be changed for whichever node is initiated from masterCommandLineArgs
and have task of:
"task":{"type":"chief","index":0}
The other worker nodes would have the same as before with index now being 0 or 1 e.g.
"task":{"type":"worker","index":1}
This sample code parses into the RunConfig
correctly, but I haven't tested it on a cluster and an estimator yet to see if it hooks it all up fine:
import os
import json
import tensorflow as tf
def log_config_for(runconfig_string):
os.environ['TF_CONFIG'] = runconfig_string
config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))
print()
def main():
machine_definitions = [
# Machine expected from settings with parameterServerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}',
# Machine expected from settings with masterCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}',
# Machine expected from settings with workerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}',
# Machine expected from settings with workerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}'
]
for definition in machine_definitions:
log_config_for(definition)
if __name__ == '__main__':
main()
I found a workaround and I was able to manipulate the TF_CONFIG environment variable and get it working and put the code here https://github.com/damienpontifex/batchai-tfconfig-workaround
The environment variable manipulation was:
def remap_tfconfig(is_master):
tf_config = json.loads(os.environ['TF_CONFIG'])
master_worker = tf_config['cluster']['worker'][0]
tf_config['cluster']['worker'] = tf_config['cluster']['worker'][1:]
tf_config['cluster']['chief'] = [master_worker]
if is_master:
tf_config['task']['type'] = 'chief'
tf_config['task']['index'] = 0
elif tf_config['task']['type'] == 'worker':
tf_config['task']['index'] -= 1
os.environ['TF_CONFIG'] = json.dumps(tf_config)
And I pass in --master
through to the masterCommandLineArgs
that gets received by ArgumentParser
by parser.add_argument('--master', action='store_true')
. Then just call remap_tfconfig(args.master)
after parse_args
Hopefully this can help in getting the fix into Batch AI 😄
Tried this again today in Azure ML Workspace with 'Machine Learning Compute' and following the Parameter Server setup and got an error
Run failed: argument of type 'ClusterSpec' is not iterable
Getting the TF_CONFIG
quite right seems to still be an issue
Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.
I found this description of chief vs. master: https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master
Based on it, master
is unsupported in TF2 and should be replaced with chief
The TensorFlow
ClusterConfig
can parse worker and parameter server settings from aTF_CONFIG
environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)I was trying to pass it via an environment variable in the job configuration file like so:
Which is kind of fine, but falls down for a few cases:
More generally though, providing this configuration via a
TF_CONFIG
environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.