microsoft / ray-on-aml

Turning AML compute into Ray cluster
Other
77 stars 12 forks source link

Unable to initialize cluster #31

Open dugar-tarun opened 1 year ago

dugar-tarun commented 1 year ago

I am not able to initialize my cluster for ray using ray-on-aml version 0.2.4. I'm running a notebook in the Python 3.8 AzureML environment. Using the following piece of code:

from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="CC-RayWorker-CPU-DS12-v2")

# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.  
ray = ray_on_aml.getRay(ci_is_head=True, num_node=2,pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", "azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.12.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

While the compute instance initializes successfully, the ray_on_aml job fails in the cluster with the following error:

Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.2714250087738037 seconds
Traceback (most recent call last):
  File "source_file.py", line 175, in <module>
    startRayMaster()
  File "source_file.py", line 103, in startRayMaster
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -2] Name or service not known

Retrying due to transient client side error HTTPSConnectionPool(host='westus-0.in.applicationinsights.azure.com', port=443): Max retries exceeded with url: /v2.1/track (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ee8697220>: Failed to establish a new connection: [Errno -2] Name or service not known')).
2023-02-16 13:21:17,476 INFO usage_lib.py:516 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-02-16 13:21:17,476 INFO scripts.py:702 -- Local node IP: 10.62.79.24
2023-02-16 13:21:19,380 SUCC scripts.py:739 -- --------------------
2023-02-16 13:21:19,380 SUCC scripts.py:740 -- Ray runtime started.
2023-02-16 13:21:19,380 SUCC scripts.py:741 -- --------------------
2023-02-16 13:21:19,380 INFO scripts.py:743 -- Next steps
2023-02-16 13:21:19,381 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2023-02-16 13:21:19,381 INFO scripts.py:747 --   ray start --address='10.62.79.24:6379'
2023-02-16 13:21:19,381 INFO scripts.py:763 -- Alternatively, use the following Python code:
2023-02-16 13:21:19,381 INFO scripts.py:765 -- import ray
2023-02-16 13:21:19,381 INFO scripts.py:769 -- ray.init(address='auto')
2023-02-16 13:21:19,381 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2023-02-16 13:21:19,381 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2023-02-16 13:21:19,381 INFO scripts.py:789 -- Python code:
2023-02-16 13:21:19,381 INFO scripts.py:791 -- import ray
2023-02-16 13:21:19,381 INFO scripts.py:792 -- ray.init(address='ray://<head_node_ip_address>:10001')
2023-02-16 13:21:19,381 INFO scripts.py:801 -- To see the status of the cluster, use
2023-02-16 13:21:19,381 INFO scripts.py:802 --   ray status
2023-02-16 13:21:19,381 INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.
2023-02-16 13:21:19,381 INFO scripts.py:820 -- To terminate the Ray runtime, run
2023-02-16 13:21:19,381 INFO scripts.py:821 --   ray stop

I have this entire setup within a VNet and all the compute resources have been created in the same subnet. Due to certain policies, I am forced to enable 'No Public IP'(npip) on my computes.

Could this be an issue due to my setup - npip or NSG? Or is it something to do with the library? Please help mitigate this.

Thank you

james-tn commented 1 year ago

yeah, I think it failed probably because of npip policy. That might have prevented the code socket.gethostbyname(socket.gethostname()) from running successfully. We'll check on the scenario with npip later. Can you try with the job mode?

dugar-tarun commented 1 year ago

No luck with job mode either. It errors out at the same line: socket.gethostbyname(socket.gethostname()) with a message "Name or service not known"