microsoft / ray-on-aml

Turning AML compute into Ray cluster
77 stars 12 forks source link

Unable to initialize cluster #13

Closed aforadi closed 1 year ago

aforadi commented 2 years ago

Hi @james-tn ,

Copying the issue from: with some modifications.

Thank you for this library. We are trying to use this library using the example code ( in an interactive environment in Azure ML. The Jupyter notebook is a Python 3.8 Azure ML notebook. We are using the latest version of ray-on-aml 0.2.1

from azureml.core import Workspace, Run, Environment
from ray_on_aml.core import Ray_On_AML
ws = Workspace.from_config()
ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ='ray-test', additional_pip_packages=['lightgbm_ray', 'sklearn'], maxnode=4)
ray = ray_on_aml.getRay()

The image builds correctly on Azure ML. However, the cluster doesn't turn on. Below is what we see in the notebook:

Cancel active AML runs if any
Shutting down ray if any
Found existing cluster ray-test
Creating new Environment ray-0.2.1-5974090952704054762
Waiting for cluster to start
{'memory': 3001307136.0,
 'CPU': 2.0,
 'object_store_memory': 1500653568.0,
 'node:': 1.0}

And the following error inside the ray_on_aml experiment:

jars files are not copied, probably due to packages such as raydp is not installed

[2022-06-10T04:14:04.526543] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
2 items cleaning up...
Cleanup took 0.11278533935546875 seconds
Traceback (most recent call last):
  File "", line 100, in <module>
  File "", line 43, in startRay
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -3] Temporary failure in name resolution

This error comes with both True and False for ci_is_head. All machines are inside the same VNET.

We are also facing an error while running as a job. Scripts below:

import logging
from ray_on_aml.core import Ray_On_AML
from adlfs import AzureBlobFileSystem
import ray'Initializing Ray')
ray_on_aml = Ray_On_AML()'Getting head node')
ray = ray_on_aml.getRay()'Retrieved head node')

if __name__ == "__main__":'Initializing file system')
    abfs = AzureBlobFileSystem(account_name="azureopendatastorage", container_name="isdweatherdatacontainer")

    if ray:  # in the headnode'Read parquet data')
        data =["az://isdweatherdatacontainer/ISDWeather/year=2015/"], filesystem=abfs)'Read parquet data finished')
        # logic to use Ray for distributed ML training, tunning or distributed data transformation with Dask

        print("in worker node")

from azureml.core import ScriptRunConfig, Experiment, Environment
from azureml.core.runconfig import DockerConfiguration, RunConfiguration
import azure_init, submit_wait_for_completion

ENV_NAME = 'Ray_Test'
workspace, datastore, compute_cluster = azure_init(cluster_name="ray-test")
docker_config = DockerConfiguration(use_docker=True)
env = Environment.from_conda_specification(name=ENV_NAME, file_path="ray_conda_env.yml")
env.docker.base_image = ""
aml_run_config_ml = RunConfiguration(communicator='OpenMpi')

aml_run_config_ml.node_count = 4 = compute_cluster
aml_run_config_ml.environment = env
aml_run_config_ml.docker = docker_config

src = ScriptRunConfig(source_directory='.', script='', run_config=aml_run_config_ml)

experiment_name = 'Ray_Test'
experiment = Experiment(workspace=workspace, name=experiment_name)
run, details = submit_wait_for_completion(src, experiment, {}, show_output=True,


  - anaconda
  - conda-forge
  - python=3.8.1
  - pip:
      - azureml-mlflow==1.41.0
      - ray-on-aml==0.2.1
      - protobuf==3.20.1
      - azureml-defaults==1.41.0
  - matplotlib
  - pip < 20.3
name: azureml_cfc9e96c7b0b43301a0ba4c6bd3548e5

We get the following error:

This is an MPI job. Rank:0
Script type = None
[2022-06-10T08:33:48.187712] Entering Run History Context Manager.
[2022-06-10T08:33:48.205558] Writing error with error_code ServiceError and error_hierarchy ServiceError/ImportError to hosttool error file located at /mnt/batch/tasks/workitems/7cf850e9-7ee8-40ed-847c-c5042bef5d51/job-1/ray_test_1654848910__11088471-ba23-4b28-a130-ee11d896e099/wd/runTaskLetTask_error.json
Starting the daemon thread to refresh tokens in background for process with pid = 156
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/", line 452, in <module>
    execute_with_context(cm_objects, options.invocation)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/", line 132, in execute_with_context
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/", line 356, in enter_context
    result = _cm_type.__enter__(cm)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/", line 80, in __enter__
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/", line 384, in __enter__
    self.history_context = get_history_context_manager(**self.history_config)
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/", line 167, in get_history_context_manager
    py_wd_cm = get_py_wd()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/", line 304, in get_py_wd
    return PythonWorkingDirectory.get()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/", line 274, in get
    from azureml._history.utils.filesystem import PythonFS
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_history/utils/", line 8, in <module>
    from azureml._restclient.constants import RUN_ORIGIN
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/", line 7, in <module>
    from .rest_client import RestClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/", line 12, in <module>
    from msrest.service_client import ServiceClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/", line 28, in <module>
    from .configuration import Configuration
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/", line 38, in <module>
    from .universal_http.requests import (
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/universal_http/", line 53, in <module>
    from ..exceptions import ClientRequestError, raise_with_traceback
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/", line 31, in <module>
    from azure.core.exceptions import SerializationError, DeserializationError
ImportError: cannot import name 'SerializationError' from 'azure.core.exceptions' (/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azure/core/

Let me know in case anything wrong with our setup or if this is an issue with the library.

Thanks a lot!

aforadi commented 2 years ago

Quick update: I was able to run it as a job by removing the version restrictions on the packages below in the environment file:

azureml-mlflow azureml-default

Interactive still remains a challenge though..

james-tn commented 2 years ago

aforadi, the error in interactive mode may have something to do with your VNET configuration. Can you check NSG in the subnet for any special restriction?

james-tn commented 2 years ago

For interactive use, you may try downgrading the ray-on-aml version and fix version the protobuf package. I'll investigate the root cause. pip install ray-on-aml==0.1.8 ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="d15-v2",additional_pip_packages=['protobuf==3.20.1'], maxnode=2)

james-tn commented 1 year ago

Please use the new ray-on-aml version