A few UX issues that came about while testing autoscalar and creating example distributed experiments.
Running ray exec commands takes a long time. Is there a way to possibly cache any component of this to speed it up?
Having to type ray exec cluster.yaml every time is a bit long and tedious. Is it possible to shorten this command? Also, the name of the .yaml file can get long. It might be helpful to store it as an environment variable instead of having to reference the filename every time.
When running cluster commands locally (e.g. ray exec cluster.yaml "..."), if there is an error, the additional subprocess error message makes it difficult to find the actual error message and isn't very helpful. Is there anyway to hide this? Here is an example of an error message. The output the user cares about is "KeyError: 'Sort Index ...", but the message is stuck in the middle of the output and hard to find.
(base) Andrew-MacBook:cluster_configs andrewtan$ ray exec cluster.yaml 'tune ls ray_results/my_first_experiment --sort meen_accuracy'
2019-04-16 16:56:23,597 INFO updater.py:90 -- NodeUpdater: Waiting for IP of i-05e1c396a466fe421...
2019-04-16 16:56:23,597 INFO log_timer.py:21 -- NodeUpdater: i-05e1c396a466fe421: Got IP [LogTimer=274ms]
2019-04-16 16:56:23,634 INFO updater.py:268 -- NodeUpdater: Running tune ls ray_results/my_first_experiment --sort meen_accuracy on 54.213.55.153...
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/tune", line 11, in <module>
sys.exit(cli())
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/tune/scripts.py", line 50, in list_trials
result_columns)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/tune/commands.py", line 195, in list_trials
sort, list(checkpoints_df)))
KeyError: 'Sort Index "meen_accuracy" not in: [\'trainable_name\', \'experiment_tag\', \'trial_id\', \'status\', \'last_update_time\', \'last_result:training_iteration\', \'last_result:mean_accuracy\']'
Shared connection to 54.213.55.153 closed.
Traceback (most recent call last):
File "/Users/andrewtan/anaconda/bin/ray", line 10, in <module>
sys.exit(main())
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/ray/scripts/scripts.py", line 774, in main
return cli()
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/ray/scripts/scripts.py", line 703, in exec_cmd
cluster_name, port_forward)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 376, in exec_cluster
port_forward=port_forward)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 415, in _exec
port_forward=port_forward)
File "/Users/andrewtan/anaconda/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
stderr=redirect or sys.stderr)
File "/Users/andrewtan/anaconda/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-tt', '-i', '/Users/andrewtan/.ssh/ray-autoscaler_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@54.213.55.153', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && tune ls ray_results/my_first_experiment --sort meen_accuracy'"]' returned non-zero exit status 1.
System information
Describe the problem
A few UX issues that came about while testing autoscalar and creating example distributed experiments.
ray exec
commands takes a long time. Is there a way to possibly cache any component of this to speed it up?ray exec cluster.yaml
every time is a bit long and tedious. Is it possible to shorten this command? Also, the name of the.yaml
file can get long. It might be helpful to store it as an environment variable instead of having to reference the filename every time.ray exec cluster.yaml "..."
), if there is an error, the additional subprocess error message makes it difficult to find the actual error message and isn't very helpful. Is there anyway to hide this? Here is an example of an error message. The output the user cares about is"KeyError: 'Sort Index ..."
, but the message is stuck in the middle of the output and hard to find.