Closed sofglide closed 3 years ago
cc @DmitriGekhtman @AmeerHajAli autoscaler.sdk.rsync is failing on Azure cluster due to authentication reasons. Any ideas here?
@gramhagen , can you please help?
these setup commands don't seem right, but it's probably not the problem.
setup_commands: []
azure-mgmt-resource==13.0.0
manylinux2014_x86_64.whl"
I think you will need to install additional packages in your docker image or in the head_setup_command, can you try this?
setup_commands: []
head_setup_commands:
- pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
Hi, actually the setup commands look like this:
setup_commands: []
The 2 lines that follow were introduced by mistake when copying the file here. (updated in the message now)
When I attach to the head and run pip I find the packages you're suggesting. This is because these packages are installed in the docker image I'm using.
(base) ray@ray-ai-hp-tune-head-6fcbadeb0:~$ pip freeze | grep azure
azure-cli-core==2.22.0
azure-cli-telemetry==1.0.6
azure-common==1.1.27
azure-core==1.17.0
azure-identity==1.6.0
azure-mgmt-compute==14.0.0
azure-mgmt-core==1.3.0
azure-mgmt-msi==1.0.0
azure-mgmt-network==10.2.0
azure-mgmt-resource==13.0.0
azure-storage-blob==12.8.1
msrestazure==0.6.4
In the docker file there is:
RUN pip install --upgrade -r requirements.txt
where requirements.txt contains those packages
# This file is autogenerated by pip-compile with python 3.8
# To update, run:
#
# pip-compile
#
adal==1.2.7
# via
# azure-cli-core
# msrestazure
aiohttp==3.7.4.post0
# via
# aiohttp-cors
# ray
aiohttp-cors==0.7.0
# via ray
aioredis==1.3.1
# via
# -r requirements.in
# ray
appdirs==1.4.4
# via black
applicationinsights==0.11.10
# via azure-cli-telemetry
argcomplete==1.12.3
# via
# azure-cli-core
# knack
astroid==2.5
# via pylint
async-timeout==3.0.1
# via
# aiohttp
# aioredis
attrs==21.2.0
# via
# aiohttp
# jsonschema
# pytest
azure-cli-core==2.22.0
# via -r requirements.in
azure-cli-telemetry==1.0.6
# via azure-cli-core
azure-common==1.1.27
# via
# azure-cli-core
# azure-mgmt-compute
# azure-mgmt-msi
# azure-mgmt-network
# azure-mgmt-resource
# smart-open
azure-core==1.17.0
# via
# azure-identity
# azure-mgmt-core
# azure-storage-blob
# smart-open
azure-identity==1.6.0
# via -r requirements.in
azure-mgmt-compute==14.0.0
# via -r requirements.in
azure-mgmt-core==1.3.0
# via azure-cli-core
azure-mgmt-msi==1.0.0
# via -r requirements.in
azure-mgmt-network==10.2.0
# via -r requirements.in
azure-mgmt-resource==13.0.0
# via -r requirements.in
azure-storage-blob==12.8.1
# via
# -r requirements.in
# smart-open
bcrypt==3.2.0
# via paramiko
black==20.8b1
# via -r requirements.in
blessings==1.7
# via gpustat
cachetools==4.2.2
# via google-auth
certifi==2021.5.30
# via
# msrest
# requests
cffi==1.14.6
# via
# bcrypt
# cryptography
# pynacl
chardet==4.0.0
# via
# aiohttp
# requests
click==8.0.1
# via
# -r requirements.in
# black
# ray
cloudpickle==1.6.0
# via hyperopt
colorama==0.4.4
# via
# azure-cli-core
# knack
# ray
colorful==0.5.4
# via ray
cramjam==2.3.2
# via fastparquet
cryptography==3.3.2
# via
# adal
# azure-cli-core
# azure-identity
# azure-storage-blob
# msal
# paramiko
# pyjwt
# pyopenssl
cycler==0.10.0
# via matplotlib
dateparser==1.0.0
# via -r requirements.in
fastparquet==0.7.1
# via -r requirements.in
filelock==3.0.12
# via ray
fsspec==2021.7.0
# via fastparquet
future==0.18.2
# via hyperopt
geojson==2.5.0
# via pyowm
google-api-core==1.31.1
# via opencensus
google-auth==1.34.0
# via google-api-core
googleapis-common-protos==1.53.0
# via google-api-core
gpustat==0.6.0
# via ray
grpcio==1.39.0
# via ray
hiredis==2.0.0
# via aioredis
humanfriendly==9.2
# via azure-cli-core
hyperopt==0.2.5
# via -r requirements.in
idna==2.10
# via
# requests
# yarl
iniconfig==1.1.1
# via pytest
isodate==0.6.0
# via msrest
isort==5.9.3
# via pylint
jmespath==0.10.0
# via
# azure-cli-core
# knack
joblib==1.0.1
# via
# -r requirements.in
# scikit-learn
jsonschema==3.2.0
# via ray
kiwisolver==1.3.1
# via matplotlib
knack==0.8.2
# via azure-cli-core
lazy-object-proxy==1.6.0
# via astroid
matplotlib==3.4.2
# via -r requirements.in
mccabe==0.6.1
# via pylint
mpmath==1.2.1
# via sympy
msal==1.13.0
# via
# azure-cli-core
# azure-identity
# msal-extensions
msal-extensions==0.3.0
# via azure-identity
msgpack==1.0.2
# via ray
msrest==0.6.21
# via
# azure-cli-core
# azure-mgmt-compute
# azure-mgmt-msi
# azure-mgmt-network
# azure-mgmt-resource
# azure-storage-blob
# msrestazure
msrestazure==0.6.4
# via
# azure-cli-core
# azure-mgmt-compute
# azure-mgmt-msi
# azure-mgmt-network
# azure-mgmt-resource
multidict==5.1.0
# via
# aiohttp
# yarl
mypy==0.812
# via -r requirements.in
mypy-extensions==0.4.3
# via
# -r requirements.in
# black
# mypy
networkx==2.6.2
# via hyperopt
numpy==1.21.1
# via
# -r requirements.in
# fastparquet
# hyperopt
# matplotlib
# pandas
# paramspace
# pyarrow
# ray
# scikit-learn
# scipy
# tensorboardx
# torch
# xarray
nvidia-ml-py3==7.352.0
# via gpustat
oauthlib==3.1.1
# via requests-oauthlib
opencensus==0.7.13
# via ray
opencensus-context==0.1.2
# via opencensus
packaging==20.9
# via
# google-api-core
# pytest
pandas==1.3.1
# via
# -r requirements.in
# fastparquet
# ray
# xarray
paramiko==2.7.2
# via azure-cli-core
paramspace==2.5.8
# via -r requirements.in
pathspec==0.9.0
# via black
pillow==8.3.1
# via matplotlib
pkginfo==1.7.1
# via azure-cli-core
pluggy==0.13.1
# via pytest
portalocker==1.7.1
# via
# azure-cli-telemetry
# msal-extensions
prometheus-client==0.11.0
# via ray
protobuf==3.17.3
# via
# google-api-core
# googleapis-common-protos
# ray
# tensorboardx
psutil==5.8.0
# via
# azure-cli-core
# gpustat
py==1.10.0
# via pytest
py-spy==0.3.7
# via ray
pyarrow==5.0.0
# via -r requirements.in
pyasn1==0.4.8
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.2.8
# via google-auth
pycodestyle==2.6.0
# via -r requirements.in
pycparser==2.20
# via cffi
pydantic==1.8.2
# via ray
pygments==2.9.0
# via knack
pyjwt[crypto]==1.7.1
# via
# adal
# azure-cli-core
# msal
pylint==2.6.0
# via -r requirements.in
pynacl==1.4.0
# via paramiko
pyodbc==4.0.31
# via -r requirements.in
pyopenssl==20.0.1
# via azure-cli-core
pyowm==3.2.0
# via -r requirements.in
pyparsing==2.4.7
# via
# matplotlib
# packaging
pyrsistent==0.18.0
# via jsonschema
pysocks==1.7.1
# via
# pyowm
# requests
pytest==6.2.4
# via -r requirements.in
python-dateutil==2.8.2
# via
# adal
# dateparser
# matplotlib
# pandas
pytz==2019.1
# via
# dateparser
# google-api-core
# pandas
# tzlocal
pyyaml==5.4.1
# via
# -r requirements.in
# knack
# ray
ray[default,tune]==1.5.0
# via -r requirements.in
redis==3.5.3
# via ray
regex==2021.7.6
# via
# black
# dateparser
requests[socks]==2.25.1
# via
# adal
# azure-cli-core
# azure-core
# google-api-core
# msal
# msrest
# pyowm
# ray
# requests-oauthlib
requests-oauthlib==1.3.0
# via msrest
rsa==4.7.2
# via google-auth
ruamel.yaml==0.17.10
# via paramspace
ruamel.yaml.clib==0.2.6
# via ruamel.yaml
scikit-learn==0.24.2
# via -r requirements.in
scipy==1.7.0
# via
# -r requirements.in
# hyperopt
# scikit-learn
shortuuid==1.0.1
# via -r requirements.in
six==1.16.0
# via
# azure-cli-core
# azure-core
# azure-identity
# bcrypt
# blessings
# cryptography
# cycler
# google-api-core
# google-auth
# gpustat
# grpcio
# hyperopt
# isodate
# jsonschema
# msrestazure
# protobuf
# pynacl
# pyopenssl
# python-dateutil
# thrift
smart-open[azure]==5.1.0
# via -r requirements.in
sqlparams==3.0.0
# via -r requirements.in
sympy==1.8
# via -r requirements.in
tabulate==0.8.9
# via
# knack
# ray
tensorboardx==2.4
# via ray
threadpoolctl==2.2.0
# via scikit-learn
thrift==0.13.0
# via fastparquet
toml==0.10.2
# via
# black
# pylint
# pytest
torch==1.8.1
# via
# -r requirements.in
# torchdata
torchdata==0.2.0
# via -r requirements.in
tqdm==4.61.2
# via
# -r requirements.in
# hyperopt
typed-ast==1.4.3
# via
# black
# mypy
typing-extensions==3.10.0.0
# via
# aiohttp
# black
# mypy
# pydantic
# torch
tzlocal==2.1
# via dateparser
urllib3==1.26.6
# via requests
wrapt==1.12.1
# via astroid
xarray==0.19.0
# via paramspace
yarl==1.6.3
# via aiohttp
# The following packages are considered to be unsafe in a requirements file:
# setuptools
can it be related to ssh keys? Is the syncing done using ssh public/private key pairs?
I found that I'm not generating such a pair, I'm just copying my own public key to the cluster through file mounts
but the keys I'm specifying in auth
section do not exist.
Where in the yaml file should I generate the pair?
Ah. Got it. Right now bootstrapping is not set up to use managed identity. If you run az login
on the head node prior to kicking off tune does it still fail to get credentials?
It does seem weird that bootstrapping is even being called in this context. I know there's an option to rsync without bootstrapping, but I don't know if that can be configured within tune.
can it be related to ssh keys? Is the syncing done using ssh public/private key pairs? I found that I'm not generating such a pair, I'm just copying my own public key to the cluster through
file mounts
but the keys I'm specifying inauth
section do not exist.Where in the yaml file should I generate the pair?
This error looks more like an inability to retrieve Azure credentials from the cli. You would have a managed identity on the head node, but normally this part of the code is run to setup a cluster so the expectation is you're doing it from your own machine where you have logged in through Azure cli.
I'm goint to install azure-cli
package in the docker image and run az login -i
in a setup_command then try again.
installed azure-cli
in docker image and modified setup_commands
- az login -i
got this error for each trial
2021-08-12 07:22:01,183 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 167, in set_token
super(MSIAuthenticationWrapper, self).set_token()
File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 600, in set_token
token_entry = self._vm_msi.get_token(self.resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 649, in get_token
token_entry = self._retrieve_token_from_imds_with_retry(resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 704, in _retrieve_token_from_imds_with_retry
raise HTTPError(request=result.request, response=result.raw)
requests.exceptions.HTTPError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down
result = self.sync_client.sync_down(self._remote_path,
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/integration/docker.py", line 102, in sync_down
rsync(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/sdk.py", line 140, in rsync
return commands.rsync(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1070, in rsync
config = _bootstrap_config(config, no_config_cache=no_config_cache)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 315, in _bootstrap_config
resolved_config = provider_cls.bootstrap_config(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config
return bootstrap_azure(cluster_config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure
config = _configure_resource_group(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 39, in _configure_resource_group
_, cli_subscription_id = get_azure_cli_credentials(
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials
cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 605, in get_login_credentials
self._msi_creds = MsiAccountTypes.msi_auth_factory(identity_type, identity_id, resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 839, in msi_auth_factory
return MSIAuthenticationWrapper(resource=resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 592, in __init__
self.set_token()
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 177, in set_token
raise AzureResponseError('Failed to connect to MSI. Please make sure MSI is configured correctly.\n'
azure.cli.core.azclierror.AzureResponseError: Failed to connect to MSI. Please make sure MSI is configured correctly.
Get Token request returned http error: 400, reason: Bad Request
2021-08-12 07:22:01,184 INFO logger.py:697 -- Removed the following hyperparameter values when logging to tensorboard: {'hyperparameters/model_layers_intercept': (64, 128, 32), 'hyper
parameters/model_layers_slope': (128, 256, 64)}
I think we've seen syncing errors of this type before and that these might be possible to correct with a slight tweak to the Tune code.
@richardliaw
I think it could be possible to avoid the error in the last comment by using ray rsync's should_bootstrap=False
flag in Tune's syncer code.
The alternative would be to grant the head node the permissions it needs to complete AzureNodeProvider.bootstrap_config
I think we've seen syncing errors of this type before and that these might be possible to correct with a slight tweak to the Tune code.
@richardliaw I think it could be possible to avoid the error in the last comment by using ray rsync's
should_bootstrap=False
flag in Tune's syncer code.The alternative would be to grant the head node the permissions it needs to complete AzureNodeProvider.bootstrap_config
I prefer to give the head node the permission. Could you be more explicit? What azure cli command should I run?
I think @gramhagen and @eisber could help with this one. The last error came from AzureNodeProvider.bootstrap_config
executing on the head node.
Zooming in a bit on the relevant part of the stacktrace:
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config
return bootstrap_azure(cluster_config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure
config = _configure_resource_group(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 39, in _configure_resource_group
_, cli_subscription_id = get_azure_cli_credentials(
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials
cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 605, in get_login_credentials
self._msi_creds = MsiAccountTypes.msi_auth_factory(identity_type, identity_id, resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 839, in msi_auth_factory
return MSIAuthenticationWrapper(resource=resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 592, in __init__
self.set_token()
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 177, in set_token
raise AzureResponseError('Failed to connect to MSI. Please make sure MSI is configured correctly.\n'
azure.cli.core.azclierror.AzureResponseError: Failed to connect to MSI. Please make sure MSI is configured correctly.
some methods in config.py are not compatible with using a managed identity, even if it's used to log in with the cli (az login -i
). we can make changes to allow that, but you would also need that managed identity to have specific role assignments which allow for resources to be created. this introduces some security concerns and i think in this context it is better to avoid running bootstrap when it's not needed.
do you need to run tune on the head node itself? can you initialize ray on your local machine and point it to the headnode to run tune?
We should definitely support the case where we run Tune on the head node.
I think avoiding bootstrap would be a good option here.
some methods in config.py are not compatible with using a managed identity, even if it's used to log in with the cli (
az login -i
). we can make changes to allow that, but you would also need that managed identity to have specific role assignments which allow for resources to be created. this introduces some security concerns and i think in this context it is better to avoid running bootstrap when it's not needed.do you need to run tune on the head node itself? can you initialize ray on your local machine and point it to the headnode to run tune?
Not running tune on the head node, does that mean connecting from another machine with ray.init(address=<PUBLIC_IP>:<PORT>, _redis_password=<PASSWORD>)
?
I tried that, it wasn't able to connect, I suppose I have to expose the port?
according to the output after setting up a cluster you should be able to do something like:
ray.init(address="ray://<public_ip_of_head>:10001")
i haven't been able to test this out though.
We should definitely support the case where we run Tune on the head node.
I think avoiding bootstrap would be a good option here.
Hi, I had a look at the PR and reproduced it in my current installation, but I can't see how from tune
API I can use it. Is there something else supposed to be implemented or am I missing something?
We should definitely support the case where we run Tune on the head node.
I think avoiding bootstrap would be a good option here.
Hi, I had a look at the PR, it adds argument should_bootstrap: bool = True
to class DockerSyncClient
.
I wanted to understand, as a tune user, is that enough for me to avoid the boot strapping error? I can't see where to take advantage of this argument. Is there something else that will modified in the API to specify should_bootstrap = False
? Would that be at ray.init
?
Cc @richardliaw -- is the flag accessible through Tune's public APIs?
You would be able to toggle this via the TUNE_SYNC_DISABLE_BOOTSTRAP
environment variable
What is the problem?
ray 1.5.0
When running tune on ray cluster in Azure, I get
ERROR syncer.py:190 -- Sync execution failed
after every trial.Reproduction (REQUIRED)
I created a ray cluster in Azure and launched a tune experiment on it by attaching the head and launching the tuning script from there.
After every trial I get the following error:
This is my cluster yaml file:
and my cpu docker image is defined with (the gpu docker image is defined the same way, just replacing
ray-ml:1.5.0-py38-cpu
withray-ml:1.5.0-py38-gpu
):