ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.84k stars 5.75k forks source link

[tune] rsync fails on azure cluster #17756

Closed sofglide closed 3 years ago

sofglide commented 3 years ago

What is the problem?

ray 1.5.0

When running tune on ray cluster in Azure, I get ERROR syncer.py:190 -- Sync execution failed after every trial.

Reproduction (REQUIRED)

I created a ray cluster in Azure and launched a tune experiment on it by attaching the head and launching the tuning script from there.

After every trial I get the following error:

2021-08-11 13:08:27,356 INFO commands.py:298 -- Checking Azure environment settings                                                                                                    
2021-08-11 13:08:27,364 ERROR syncer.py:190 -- Sync execution failed.                                                                                                                  
Traceback (most recent call last):                                                                                                                                                     
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down                                                                                    
    result = self.sync_client.sync_down(self._remote_path,                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/integration/docker.py", line 102, in sync_down                                                                        
    rsync(                                                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/sdk.py", line 140, in rsync                                                                                     
    return commands.rsync(                                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1070, in rsync                                                                      
    config = _bootstrap_config(config, no_config_cache=no_config_cache)                                                                                                                
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 315, in _bootstrap_config                                                           
    resolved_config = provider_cls.bootstrap_config(config)                                                                                                                            
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config                                                
    return bootstrap_azure(cluster_config)                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure                                                         
    config = _configure_resource_group(config)                                                                                                                                         
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 37, in _configure_resource_group                                               
    resource_client = _get_client(ResourceManagementClient, config)                                                                                                                    
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 31, in _get_client                                                             
    return get_client_from_cli_profile(client_class=client_class, **kwargs)                                                                                                            
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/client_factory.py", line 83, in get_client_from_cli_profile                                                       
    credentials, subscription_id, tenant_id = get_azure_cli_credentials(                                                                                                               
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials                                                            
    cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)                                                                                                
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 546, in get_login_credentials                                                                
    account = self.get_subscription(subscription_id)                                                                                                                                   
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 505, in get_subscription                                                                     
    raise CLIError(_AZ_LOGIN_MESSAGE)                                                                                                                                                  
knack.util.CLIError: Please run 'az login' to setup account.                                                                                                                           
2021-08-11 13:08:27,365 INFO logger.py:697 -- Removed the following hyperparameter values when logging to tensorboard: {'hyperparameters/model_layers_intercept': (64, 128, 32), 'hyper
parameters/model_layers_slope': (128, 256, 64)}                  

This is my cluster yaml file:

cluster_name: my-private-cluster

max_workers: 10
target_utilization_fraction: 0.8

idle_timeout_minutes: 30

docker:
    head_image: "myregistry.azurecr.io/custom-ay-ml-cpu:latest"
    worker_image: "myregistry.azurecr.io/custom-ray-ml-gpu:latest"
    container_name: "ray_py38_1.5.0_gpu"
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westeurope
    resource_group: my-group-cluster
    # set subscription id otherwise the default from az cli will be used
    subscription_id: xxxxxx-xxxxxx-xxxxxx-xxxx-xxxx  # (masked subscription id)

auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

available_node_types:
    node_cpu_2:
        min_workers: 0
        max_workers: 3
        resources: {"CPU": 2}
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: 21.07.12
    node_gpu_1_cpu_4:
        min_workers: 1
        max_workers: 2
        resources: { "CPU": 4, "GPU": 1 }
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC4as_T4_v3
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: 21.07.12

head_node_type: node_cpu_2

file_mounts:
    ~/.ssh/id_rsa.pub: "~/.ssh/id_rsa.pub"
    ~/my-project: "."

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"
    - ".venv"
    - ".venv/**"

rsync_filter:
    - ".gitignore"

initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    - az login -i && az acr login --name myregistry
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

and my cpu docker image is defined with (the gpu docker image is defined the same way, just replacing ray-ml:1.5.0-py38-cpu with ray-ml:1.5.0-py38-gpu):

FROM rayproject/ray-ml:1.5.0-py38-cpu

COPY requirements.txt ./

RUN sudo apt-get update && sudo apt-get install -y curl gnupg lsb-core && \
    curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add - && \
    echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/$(lsb_release -sr)/prod $(lsb_release -sc) main" | \
    sudo tee /etc/apt/sources.list.d/mssql-release.list && \
    sudo apt-get update && \
    sudo ACCEPT_EULA=Y apt-get install -y msodbcsql17 mssql-tools unixodbc-dev build-essential unixodbc

RUN pip install --upgrade -r requirements.txt
amogkam commented 3 years ago

cc @DmitriGekhtman @AmeerHajAli autoscaler.sdk.rsync is failing on Azure cluster due to authentication reasons. Any ideas here?

AmeerHajAli commented 3 years ago

@gramhagen , can you please help?

gramhagen commented 3 years ago

these setup commands don't seem right, but it's probably not the problem.

setup_commands: []
azure-mgmt-resource==13.0.0
manylinux2014_x86_64.whl"

I think you will need to install additional packages in your docker image or in the head_setup_command, can you try this?

setup_commands: []

head_setup_commands:
    - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
sofglide commented 3 years ago

Hi, actually the setup commands look like this: setup_commands: [] The 2 lines that follow were introduced by mistake when copying the file here. (updated in the message now)

When I attach to the head and run pip I find the packages you're suggesting. This is because these packages are installed in the docker image I'm using.

(base) ray@ray-ai-hp-tune-head-6fcbadeb0:~$ pip freeze | grep azure
azure-cli-core==2.22.0
azure-cli-telemetry==1.0.6
azure-common==1.1.27
azure-core==1.17.0
azure-identity==1.6.0
azure-mgmt-compute==14.0.0
azure-mgmt-core==1.3.0
azure-mgmt-msi==1.0.0
azure-mgmt-network==10.2.0
azure-mgmt-resource==13.0.0
azure-storage-blob==12.8.1
msrestazure==0.6.4

In the docker file there is: RUN pip install --upgrade -r requirements.txt where requirements.txt contains those packages

# This file is autogenerated by pip-compile with python 3.8
# To update, run:
#
#    pip-compile
#
adal==1.2.7
    # via
    #   azure-cli-core
    #   msrestazure
aiohttp==3.7.4.post0
    # via
    #   aiohttp-cors
    #   ray
aiohttp-cors==0.7.0
    # via ray
aioredis==1.3.1
    # via
    #   -r requirements.in
    #   ray
appdirs==1.4.4
    # via black
applicationinsights==0.11.10
    # via azure-cli-telemetry
argcomplete==1.12.3
    # via
    #   azure-cli-core
    #   knack
astroid==2.5
    # via pylint
async-timeout==3.0.1
    # via
    #   aiohttp
    #   aioredis
attrs==21.2.0
    # via
    #   aiohttp
    #   jsonschema
    #   pytest
azure-cli-core==2.22.0
    # via -r requirements.in
azure-cli-telemetry==1.0.6
    # via azure-cli-core
azure-common==1.1.27
    # via
    #   azure-cli-core
    #   azure-mgmt-compute
    #   azure-mgmt-msi
    #   azure-mgmt-network
    #   azure-mgmt-resource
    #   smart-open
azure-core==1.17.0
    # via
    #   azure-identity
    #   azure-mgmt-core
    #   azure-storage-blob
    #   smart-open
azure-identity==1.6.0
    # via -r requirements.in
azure-mgmt-compute==14.0.0
    # via -r requirements.in
azure-mgmt-core==1.3.0
    # via azure-cli-core
azure-mgmt-msi==1.0.0
    # via -r requirements.in
azure-mgmt-network==10.2.0
    # via -r requirements.in
azure-mgmt-resource==13.0.0
    # via -r requirements.in
azure-storage-blob==12.8.1
    # via
    #   -r requirements.in
    #   smart-open
bcrypt==3.2.0
    # via paramiko
black==20.8b1
    # via -r requirements.in
blessings==1.7
    # via gpustat
cachetools==4.2.2
    # via google-auth
certifi==2021.5.30
    # via
    #   msrest
    #   requests
cffi==1.14.6
    # via
    #   bcrypt
    #   cryptography
    #   pynacl
chardet==4.0.0
    # via
    #   aiohttp
    #   requests
click==8.0.1
    # via
    #   -r requirements.in
    #   black
    #   ray
cloudpickle==1.6.0
    # via hyperopt
colorama==0.4.4
    # via
    #   azure-cli-core
    #   knack
    #   ray
colorful==0.5.4
    # via ray
cramjam==2.3.2
    # via fastparquet
cryptography==3.3.2
    # via
    #   adal
    #   azure-cli-core
    #   azure-identity
    #   azure-storage-blob
    #   msal
    #   paramiko
    #   pyjwt
    #   pyopenssl
cycler==0.10.0
    # via matplotlib
dateparser==1.0.0
    # via -r requirements.in
fastparquet==0.7.1
    # via -r requirements.in
filelock==3.0.12
    # via ray
fsspec==2021.7.0
    # via fastparquet
future==0.18.2
    # via hyperopt
geojson==2.5.0
    # via pyowm
google-api-core==1.31.1
    # via opencensus
google-auth==1.34.0
    # via google-api-core
googleapis-common-protos==1.53.0
    # via google-api-core
gpustat==0.6.0
    # via ray
grpcio==1.39.0
    # via ray
hiredis==2.0.0
    # via aioredis
humanfriendly==9.2
    # via azure-cli-core
hyperopt==0.2.5
    # via -r requirements.in
idna==2.10
    # via
    #   requests
    #   yarl
iniconfig==1.1.1
    # via pytest
isodate==0.6.0
    # via msrest
isort==5.9.3
    # via pylint
jmespath==0.10.0
    # via
    #   azure-cli-core
    #   knack
joblib==1.0.1
    # via
    #   -r requirements.in
    #   scikit-learn
jsonschema==3.2.0
    # via ray
kiwisolver==1.3.1
    # via matplotlib
knack==0.8.2
    # via azure-cli-core
lazy-object-proxy==1.6.0
    # via astroid
matplotlib==3.4.2
    # via -r requirements.in
mccabe==0.6.1
    # via pylint
mpmath==1.2.1
    # via sympy
msal==1.13.0
    # via
    #   azure-cli-core
    #   azure-identity
    #   msal-extensions
msal-extensions==0.3.0
    # via azure-identity
msgpack==1.0.2
    # via ray
msrest==0.6.21
    # via
    #   azure-cli-core
    #   azure-mgmt-compute
    #   azure-mgmt-msi
    #   azure-mgmt-network
    #   azure-mgmt-resource
    #   azure-storage-blob
    #   msrestazure
msrestazure==0.6.4
    # via
    #   azure-cli-core
    #   azure-mgmt-compute
    #   azure-mgmt-msi
    #   azure-mgmt-network
    #   azure-mgmt-resource
multidict==5.1.0
    # via
    #   aiohttp
    #   yarl
mypy==0.812
    # via -r requirements.in
mypy-extensions==0.4.3
    # via
    #   -r requirements.in
    #   black
    #   mypy
networkx==2.6.2
    # via hyperopt
numpy==1.21.1
    # via
    #   -r requirements.in
    #   fastparquet
    #   hyperopt
    #   matplotlib
    #   pandas
    #   paramspace
    #   pyarrow
    #   ray
    #   scikit-learn
    #   scipy
    #   tensorboardx
    #   torch
    #   xarray
nvidia-ml-py3==7.352.0
    # via gpustat
oauthlib==3.1.1
    # via requests-oauthlib
opencensus==0.7.13
    # via ray
opencensus-context==0.1.2
    # via opencensus
packaging==20.9
    # via
    #   google-api-core
    #   pytest
pandas==1.3.1
    # via
    #   -r requirements.in
    #   fastparquet
    #   ray
    #   xarray
paramiko==2.7.2
    # via azure-cli-core
paramspace==2.5.8
    # via -r requirements.in
pathspec==0.9.0
    # via black
pillow==8.3.1
    # via matplotlib
pkginfo==1.7.1
    # via azure-cli-core
pluggy==0.13.1
    # via pytest
portalocker==1.7.1
    # via
    #   azure-cli-telemetry
    #   msal-extensions
prometheus-client==0.11.0
    # via ray
protobuf==3.17.3
    # via
    #   google-api-core
    #   googleapis-common-protos
    #   ray
    #   tensorboardx
psutil==5.8.0
    # via
    #   azure-cli-core
    #   gpustat
py==1.10.0
    # via pytest
py-spy==0.3.7
    # via ray
pyarrow==5.0.0
    # via -r requirements.in
pyasn1==0.4.8
    # via
    #   pyasn1-modules
    #   rsa
pyasn1-modules==0.2.8
    # via google-auth
pycodestyle==2.6.0
    # via -r requirements.in
pycparser==2.20
    # via cffi
pydantic==1.8.2
    # via ray
pygments==2.9.0
    # via knack
pyjwt[crypto]==1.7.1
    # via
    #   adal
    #   azure-cli-core
    #   msal
pylint==2.6.0
    # via -r requirements.in
pynacl==1.4.0
    # via paramiko
pyodbc==4.0.31
    # via -r requirements.in
pyopenssl==20.0.1
    # via azure-cli-core
pyowm==3.2.0
    # via -r requirements.in
pyparsing==2.4.7
    # via
    #   matplotlib
    #   packaging
pyrsistent==0.18.0
    # via jsonschema
pysocks==1.7.1
    # via
    #   pyowm
    #   requests
pytest==6.2.4
    # via -r requirements.in
python-dateutil==2.8.2
    # via
    #   adal
    #   dateparser
    #   matplotlib
    #   pandas
pytz==2019.1
    # via
    #   dateparser
    #   google-api-core
    #   pandas
    #   tzlocal
pyyaml==5.4.1
    # via
    #   -r requirements.in
    #   knack
    #   ray
ray[default,tune]==1.5.0
    # via -r requirements.in
redis==3.5.3
    # via ray
regex==2021.7.6
    # via
    #   black
    #   dateparser
requests[socks]==2.25.1
    # via
    #   adal
    #   azure-cli-core
    #   azure-core
    #   google-api-core
    #   msal
    #   msrest
    #   pyowm
    #   ray
    #   requests-oauthlib
requests-oauthlib==1.3.0
    # via msrest
rsa==4.7.2
    # via google-auth
ruamel.yaml==0.17.10
    # via paramspace
ruamel.yaml.clib==0.2.6
    # via ruamel.yaml
scikit-learn==0.24.2
    # via -r requirements.in
scipy==1.7.0
    # via
    #   -r requirements.in
    #   hyperopt
    #   scikit-learn
shortuuid==1.0.1
    # via -r requirements.in
six==1.16.0
    # via
    #   azure-cli-core
    #   azure-core
    #   azure-identity
    #   bcrypt
    #   blessings
    #   cryptography
    #   cycler
    #   google-api-core
    #   google-auth
    #   gpustat
    #   grpcio
    #   hyperopt
    #   isodate
    #   jsonschema
    #   msrestazure
    #   protobuf
    #   pynacl
    #   pyopenssl
    #   python-dateutil
    #   thrift
smart-open[azure]==5.1.0
    # via -r requirements.in
sqlparams==3.0.0
    # via -r requirements.in
sympy==1.8
    # via -r requirements.in
tabulate==0.8.9
    # via
    #   knack
    #   ray
tensorboardx==2.4
    # via ray
threadpoolctl==2.2.0
    # via scikit-learn
thrift==0.13.0
    # via fastparquet
toml==0.10.2
    # via
    #   black
    #   pylint
    #   pytest
torch==1.8.1
    # via
    #   -r requirements.in
    #   torchdata
torchdata==0.2.0
    # via -r requirements.in
tqdm==4.61.2
    # via
    #   -r requirements.in
    #   hyperopt
typed-ast==1.4.3
    # via
    #   black
    #   mypy
typing-extensions==3.10.0.0
    # via
    #   aiohttp
    #   black
    #   mypy
    #   pydantic
    #   torch
tzlocal==2.1
    # via dateparser
urllib3==1.26.6
    # via requests
wrapt==1.12.1
    # via astroid
xarray==0.19.0
    # via paramspace
yarl==1.6.3
    # via aiohttp

# The following packages are considered to be unsafe in a requirements file:
# setuptools
sofglide commented 3 years ago

can it be related to ssh keys? Is the syncing done using ssh public/private key pairs? I found that I'm not generating such a pair, I'm just copying my own public key to the cluster through file mounts but the keys I'm specifying in auth section do not exist.

Where in the yaml file should I generate the pair?

gramhagen commented 3 years ago

Ah. Got it. Right now bootstrapping is not set up to use managed identity. If you run az login on the head node prior to kicking off tune does it still fail to get credentials? It does seem weird that bootstrapping is even being called in this context. I know there's an option to rsync without bootstrapping, but I don't know if that can be configured within tune.

gramhagen commented 3 years ago

can it be related to ssh keys? Is the syncing done using ssh public/private key pairs? I found that I'm not generating such a pair, I'm just copying my own public key to the cluster through file mounts but the keys I'm specifying in auth section do not exist.

Where in the yaml file should I generate the pair?

This error looks more like an inability to retrieve Azure credentials from the cli. You would have a managed identity on the head node, but normally this part of the code is run to setup a cluster so the expectation is you're doing it from your own machine where you have logged in through Azure cli.

sofglide commented 3 years ago

I'm goint to install azure-cli package in the docker image and run az login -i in a setup_command then try again.

sofglide commented 3 years ago

installed azure-cli in docker image and modified setup_commands

    - az login -i

got this error for each trial

2021-08-12 07:22:01,183 ERROR syncer.py:190 -- Sync execution failed.                                                                                                                  
Traceback (most recent call last):                                                                                                                                                     
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 167, in set_token                                                                 
    super(MSIAuthenticationWrapper, self).set_token()                                                                                                                                  
  File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 600, in set_token                                                                 
    token_entry = self._vm_msi.get_token(self.resource)                                                                                                                                
  File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 649, in get_token                                                                 
    token_entry = self._retrieve_token_from_imds_with_retry(resource)                                                                                                                  
  File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 704, in _retrieve_token_from_imds_with_retry                                      
    raise HTTPError(request=result.request, response=result.raw)                                                                                                                       
requests.exceptions.HTTPError                                                                                                                                                          

During handling of the above exception, another exception occurred:                                                                                                                    

Traceback (most recent call last):                                                                                                                                                     
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down                                                                                    
    result = self.sync_client.sync_down(self._remote_path,                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/integration/docker.py", line 102, in sync_down
    rsync(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/sdk.py", line 140, in rsync
    return commands.rsync(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1070, in rsync
    config = _bootstrap_config(config, no_config_cache=no_config_cache)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 315, in _bootstrap_config
    resolved_config = provider_cls.bootstrap_config(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config
    return bootstrap_azure(cluster_config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure
    config = _configure_resource_group(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 39, in _configure_resource_group
    _, cli_subscription_id = get_azure_cli_credentials(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials
    cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 605, in get_login_credentials
    self._msi_creds = MsiAccountTypes.msi_auth_factory(identity_type, identity_id, resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 839, in msi_auth_factory
    return MSIAuthenticationWrapper(resource=resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 592, in __init__
    self.set_token()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 177, in set_token
    raise AzureResponseError('Failed to connect to MSI. Please make sure MSI is configured correctly.\n'
azure.cli.core.azclierror.AzureResponseError: Failed to connect to MSI. Please make sure MSI is configured correctly.
Get Token request returned http error: 400, reason: Bad Request
2021-08-12 07:22:01,184 INFO logger.py:697 -- Removed the following hyperparameter values when logging to tensorboard: {'hyperparameters/model_layers_intercept': (64, 128, 32), 'hyper
parameters/model_layers_slope': (128, 256, 64)}
DmitriGekhtman commented 3 years ago

I think we've seen syncing errors of this type before and that these might be possible to correct with a slight tweak to the Tune code.

@richardliaw I think it could be possible to avoid the error in the last comment by using ray rsync's should_bootstrap=False flag in Tune's syncer code.

The alternative would be to grant the head node the permissions it needs to complete AzureNodeProvider.bootstrap_config

sofglide commented 3 years ago

I think we've seen syncing errors of this type before and that these might be possible to correct with a slight tweak to the Tune code.

@richardliaw I think it could be possible to avoid the error in the last comment by using ray rsync's should_bootstrap=False flag in Tune's syncer code.

The alternative would be to grant the head node the permissions it needs to complete AzureNodeProvider.bootstrap_config

I prefer to give the head node the permission. Could you be more explicit? What azure cli command should I run?

DmitriGekhtman commented 3 years ago

I think @gramhagen and @eisber could help with this one. The last error came from AzureNodeProvider.bootstrap_config executing on the head node. Zooming in a bit on the relevant part of the stacktrace:

  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config
    return bootstrap_azure(cluster_config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure
    config = _configure_resource_group(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 39, in _configure_resource_group
    _, cli_subscription_id = get_azure_cli_credentials(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials
    cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 605, in get_login_credentials
    self._msi_creds = MsiAccountTypes.msi_auth_factory(identity_type, identity_id, resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 839, in msi_auth_factory
    return MSIAuthenticationWrapper(resource=resource)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/msrestazure/azure_active_directory.py", line 592, in __init__
    self.set_token()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/adal_authentication.py", line 177, in set_token
    raise AzureResponseError('Failed to connect to MSI. Please make sure MSI is configured correctly.\n'
azure.cli.core.azclierror.AzureResponseError: Failed to connect to MSI. Please make sure MSI is configured correctly.
gramhagen commented 3 years ago

some methods in config.py are not compatible with using a managed identity, even if it's used to log in with the cli (az login -i). we can make changes to allow that, but you would also need that managed identity to have specific role assignments which allow for resources to be created. this introduces some security concerns and i think in this context it is better to avoid running bootstrap when it's not needed.

do you need to run tune on the head node itself? can you initialize ray on your local machine and point it to the headnode to run tune?

richardliaw commented 3 years ago

We should definitely support the case where we run Tune on the head node.

I think avoiding bootstrap would be a good option here.

sofglide commented 3 years ago

some methods in config.py are not compatible with using a managed identity, even if it's used to log in with the cli (az login -i). we can make changes to allow that, but you would also need that managed identity to have specific role assignments which allow for resources to be created. this introduces some security concerns and i think in this context it is better to avoid running bootstrap when it's not needed.

do you need to run tune on the head node itself? can you initialize ray on your local machine and point it to the headnode to run tune?

Not running tune on the head node, does that mean connecting from another machine with ray.init(address=<PUBLIC_IP>:<PORT>, _redis_password=<PASSWORD>) ?

I tried that, it wasn't able to connect, I suppose I have to expose the port?

gramhagen commented 3 years ago

according to the output after setting up a cluster you should be able to do something like: ray.init(address="ray://<public_ip_of_head>:10001")

i haven't been able to test this out though.

sofglide commented 3 years ago

We should definitely support the case where we run Tune on the head node.

I think avoiding bootstrap would be a good option here.

Hi, I had a look at the PR and reproduced it in my current installation, but I can't see how from tune API I can use it. Is there something else supposed to be implemented or am I missing something?

sofglide commented 3 years ago

We should definitely support the case where we run Tune on the head node.

I think avoiding bootstrap would be a good option here.

Hi, I had a look at the PR, it adds argument should_bootstrap: bool = True to class DockerSyncClient.

I wanted to understand, as a tune user, is that enough for me to avoid the boot strapping error? I can't see where to take advantage of this argument. Is there something else that will modified in the API to specify should_bootstrap = False? Would that be at ray.init?

DmitriGekhtman commented 3 years ago

Cc @richardliaw -- is the flag accessible through Tune's public APIs?

amogkam commented 3 years ago

You would be able to toggle this via the TUNE_SYNC_DISABLE_BOOTSTRAP environment variable