microsoft / ray-on-aml

Turning AML compute into Ray cluster
Other
77 stars 12 forks source link

Unable to initialize cluster #13

Closed aforadi closed 1 year ago

aforadi commented 2 years ago

Hi @james-tn ,

Copying the issue from: https://github.com/james-tn/ray-on-aml/issues/24 with some modifications.

Thank you for this library. We are trying to use this library using the example code (https://github.com/microsoft/ray-on-aml/blob/master/examples/quick_start_examples.ipynb) in an interactive environment in Azure ML. The Jupyter notebook is a Python 3.8 Azure ML notebook. We are using the latest version of ray-on-aml 0.2.1

from azureml.core import Workspace, Run, Environment
from ray_on_aml.core import Ray_On_AML
ws = Workspace.from_config()
ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ='ray-test', additional_pip_packages=['lightgbm_ray', 'sklearn'], maxnode=4)
ray = ray_on_aml.getRay()

The image builds correctly on Azure ML. However, the cluster doesn't turn on. Below is what we see in the notebook:

Cancel active AML runs if any
Shutting down ray if any
Found existing cluster ray-test
Creating new Environment ray-0.2.1-5974090952704054762
Waiting for cluster to start
......................
{'memory': 3001307136.0,
 'CPU': 2.0,
 'object_store_memory': 1500653568.0,
 'node:10.54.42.20': 1.0}

And the following error inside the ray_on_aml experiment:

jars files are not copied, probably due to packages such as raydp is not installed

[2022-06-10T04:14:04.526543] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
2 items cleaning up...
Cleanup took 0.11278533935546875 seconds
Traceback (most recent call last):
  File "source_file.py", line 100, in <module>
    startRay(master_ip)
  File "source_file.py", line 43, in startRay
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -3] Temporary failure in name resolution

This error comes with both True and False for ci_is_head. All machines are inside the same VNET.

We are also facing an error while running as a job. Scripts below:

ray_test.py

import logging
from ray_on_aml.core import Ray_On_AML
from adlfs import AzureBlobFileSystem
import ray

logging.info(f'Initializing Ray')
ray_on_aml = Ray_On_AML()
logging.info(f'Getting head node')
ray = ray_on_aml.getRay()
logging.info(f'Retrieved head node')

if __name__ == "__main__":
    logging.info(f'Initializing file system')
    abfs = AzureBlobFileSystem(account_name="azureopendatastorage", container_name="isdweatherdatacontainer")

    if ray:  # in the headnode
        logging.info(f'Read parquet data')
        data = ray.data.read_parquet(["az://isdweatherdatacontainer/ISDWeather/year=2015/"], filesystem=abfs)
        logging.info(f'Read parquet data finished')
        pass
        # logic to use Ray for distributed ML training, tunning or distributed data transformation with Dask

    else:
        print("in worker node")

ray_trigger.py

from azureml.core import ScriptRunConfig, Experiment, Environment
from azureml.core.runconfig import DockerConfiguration, RunConfiguration
import azure_init, submit_wait_for_completion

ENV_NAME = 'Ray_Test'
workspace, datastore, compute_cluster = azure_init(cluster_name="ray-test")
docker_config = DockerConfiguration(use_docker=True)
env = Environment.from_conda_specification(name=ENV_NAME, file_path="ray_conda_env.yml")
env.docker.base_image = "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04:20220329.v1"
aml_run_config_ml = RunConfiguration(communicator='OpenMpi')

aml_run_config_ml.node_count = 4
aml_run_config_ml.target = compute_cluster
aml_run_config_ml.environment = env
aml_run_config_ml.docker = docker_config

src = ScriptRunConfig(source_directory='.', script='ray_test.py', run_config=aml_run_config_ml)

experiment_name = 'Ray_Test'
experiment = Experiment(workspace=workspace, name=experiment_name)
run, details = submit_wait_for_completion(src, experiment, {}, show_output=True,
                                          wait_post_processing=False)

ray_conda_env.yml

channels:
  - anaconda
  - conda-forge
dependencies:
  - python=3.8.1
  - pip:
      - azureml-mlflow==1.41.0
      - ray-on-aml==0.2.1
      - protobuf==3.20.1
      - azureml-defaults==1.41.0
  - matplotlib
  - pip < 20.3
name: azureml_cfc9e96c7b0b43301a0ba4c6bd3548e5

We get the following error:

This is an MPI job. Rank:0
Script type = None
[2022-06-10T08:33:48.187712] Entering Run History Context Manager.
[2022-06-10T08:33:48.205558] Writing error with error_code ServiceError and error_hierarchy ServiceError/ImportError to hosttool error file located at /mnt/batch/tasks/workitems/7cf850e9-7ee8-40ed-847c-c5042bef5d51/job-1/ray_test_1654848910__11088471-ba23-4b28-a130-ee11d896e099/wd/runTaskLetTask_error.json
Starting the daemon thread to refresh tokens in background for process with pid = 156
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 452, in <module>
    execute_with_context(cm_objects, options.invocation)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 132, in execute_with_context
    stack.enter_context(wrapper)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/_vendor_contextlib2.py", line 356, in enter_context
    result = _cm_type.__enter__(cm)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 80, in __enter__
    self.context_manager.__enter__()
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_managers.py", line 384, in __enter__
    self.history_context = get_history_context_manager(**self.history_config)
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 167, in get_history_context_manager
    py_wd_cm = get_py_wd()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 304, in get_py_wd
    return PythonWorkingDirectory.get()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 274, in get
    from azureml._history.utils.filesystem import PythonFS
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_history/utils/filesystem.py", line 8, in <module>
    from azureml._restclient.constants import RUN_ORIGIN
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/__init__.py", line 7, in <module>
    from .rest_client import RestClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/rest_client.py", line 12, in <module>
    from msrest.service_client import ServiceClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/__init__.py", line 28, in <module>
    from .configuration import Configuration
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/configuration.py", line 38, in <module>
    from .universal_http.requests import (
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/universal_http/__init__.py", line 53, in <module>
    from ..exceptions import ClientRequestError, raise_with_traceback
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/exceptions.py", line 31, in <module>
    from azure.core.exceptions import SerializationError, DeserializationError
ImportError: cannot import name 'SerializationError' from 'azure.core.exceptions' (/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azure/core/exceptions.py)

Let me know in case anything wrong with our setup or if this is an issue with the library.

Thanks a lot!

aforadi commented 2 years ago

Quick update: I was able to run it as a job by removing the version restrictions on the packages below in the environment file:

azureml-mlflow azureml-default

Interactive still remains a challenge though..

james-tn commented 2 years ago

aforadi, the error in interactive mode may have something to do with your VNET configuration. Can you check NSG in the subnet for any special restriction?

james-tn commented 2 years ago

For interactive use, you may try downgrading the ray-on-aml version and fix version the protobuf package. I'll investigate the root cause. pip install ray-on-aml==0.1.8 ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="d15-v2",additional_pip_packages=['protobuf==3.20.1'], maxnode=2)

james-tn commented 1 year ago

Please use the new ray-on-aml version