ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.38k stars 5.66k forks source link

Ray cluster up error #30226

Closed 1tac11 closed 1 year ago

1tac11 commented 1 year ago

What happened + What you expected to happen

crash with error:

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Traceback (most recent call last):
  File "/home/****/miniconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2339, in main
    return cli()
  File "/home/****/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/****/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/*****/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/****/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/****/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/*****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
    return f(*args, **kwargs)
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1184, in up
    create_or_update_cluster(
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 277, in create_or_update_cluster
    get_or_create_head_node(
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 634, in get_or_create_head_node
    provider = _provider or _get_node_provider(
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/providers.py", line 242, in _get_node_provider
    new_provider = provider_cls(provider_config, cluster_name)
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/local/node_provider.py", line 185, in __init__
    self.state = ClusterState(
  File "/home/****/miniconda3/lib/python3.8/site-packages/ray/autoscaler/_private/local/node_provider.py", line 36, in __init__
    with self.file_lock:
  File "/home/****/miniconda3/lib/python3.8/site-packages/filelock/_api.py", line 220, in __enter__
    self.acquire()
  File "/home/****/miniconda3/lib/python3.8/site-packages/filelock/_api.py", line 173, in acquire
    self._acquire()
  File "/home/****/miniconda3/lib/python3.8/site-packages/filelock/_unix.py", line 35, in _acquire
    fd = os.open(self._lock_file, open_mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/cluster-lambda.lock`

Versions / Dependencies

ray 1.13.0 python 3.8.13 conda 22.9.0 pip 22.1.2 Ubuntu 18.04

Reproduction script

I want to connect with ray to lambdalabs.com started instance take vanilla example-full.yaml comment out docker ssh_user ubuntu fill in head ip take same ip for workers since only one ip is set so far

Issue Severity

Low: It annoys or frustrates me.

DmitriGekhtman commented 1 year ago

That is very strange. We make the lock file directory in line 35. Then when creating the lock file in line 36, we get an error about the file not existing.

Here's the relevant code: https://github.com/ray-project/ray/blob/467b248c4fff885b478a87da8d64c19c2c363049/python/ray/autoscaler/_private/local/node_provider.py#L35-L36

I haven't seen this happen when I've used local node provider, so I suspect we may have trouble reproducing the issue.

@1seck I would recommend looking into debugging this yourself -- try dropping a couple of breakpoints in the relevant Ray and FileLock code and see what's up.

re: cluster launcher testing, cc @scv119, we should try to cover LocalNodeProvider (the on-prem node provider) eventually.

1tac11 commented 1 year ago

Were just some wrong parameters, like commenting out instead of empty array etc. or setting the same ip-address for head and worker, I don't quite recall, but could be maybe documented better, error description wise. I close for now.