ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.55k stars 5.7k forks source link

Known issues on Windows #9114

Closed mehrdadn closed 3 years ago

mehrdadn commented 4 years ago

This page is intended to list known Ray issues on Windows in one central location.

The latest nightly wheels may have already addressed some issues since the latest official release.
Check back here for updates as issues are addressed.

You can vote by reacting 👍 on each issue that is impacting you to help us prioritize issues. 🙂

(Maintainers: Please reference this issue in other posts. That will allow their statuses to show up here.)

cristiangofiar commented 4 years ago

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm ci && npm run build): 'C:\Users\crist\anaconda3\lib\site-packages\ray\dashboard\client/build'

Help!

richardliaw commented 4 years ago

@cristiangofiar thanks for opening this issue! That should be non-fatal; we should reduce the severity of that error.

cc @mfitton maybe let's just log an "info" message rather than Error or Warning.

cristiangofiar commented 4 years ago

@richardliaw But this failure can affect the execution of the program? I need to use Ray for an integrative job on a college subject! Also, the bug bothers! You can help?

valentasgruzauskas commented 4 years ago

Runing exeriment with HyperOptSearch and LightGBM, and receive rror message.

raise TuneError("Trials did not complete", incomplete_trials)

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\Users\User\ray_results\train_flat_price\train_flat_price_1_bagging_fraction=tune.sample_from(<function uniform.. at 0x000001C05ECACE58>),feature_fraction=_2020-08-20_18-00-270k00linp'

mfitton commented 4 years ago

@cristiangofiar As a short term fix you could disable the dashboard by using the argument --include-webui=False at the command line or include_webui=False in the call to ray.init() in your python code depending how you start it up. (Note this argument is being changed to --include-dashboard and include_dashboard respectively, but I don't know what version you're using.)

There are other issues with the Dashboard on Windows still that are still being fixed. Currently, even if you get the dashboard to start, it won't render anything. That said, this will not affect the running of your script.

cristiangofiar commented 4 years ago

@mfitton Thanks you very much! I have fixed it thanks to you! :D

richardliaw commented 4 years ago

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\Users\User\ray_results\train_flat_price\train_flat_price_1_bagging_fraction=tune.sample_from(<function uniform.. at 0x000001C05ECACE58>),feature_fraction=_2020-08-20_18-00-270k00linp'

@valentasgruzauskas can you post a longer stacktrace?

valentasgruzauskas commented 4 years ago

The fatal error was a problem from my side, I used tune to generate input data, but used a search algorithm. Now I define the search space with hyperopt randint, uniform etc. and it works (at least no fatal errors). However, I keep receiving an error.

2020-08-22 13:47:05,381 WARNING util.py:137 -- The experiment_checkpoint operation took 10.414000749588013 seconds to complete, which may be a performance bottleneck. 2020-08-22 13:47:05,382 ERROR trial_runner.py:375 -- Trial Runner checkpointing failed. Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\ntanalysis\lib\site-packages\ray\tune\trial_runner.py", line 373, in step self.checkpoint() File "C:\ProgramData\Anaconda3\envs\ntanalysis\lib\site-packages\ray\tune\trial_runner.py", line 302, in checkpoint self._local_checkpoint_dir, session_str=self._session_str) File "C:\ProgramData\Anaconda3\envs\ntanalysis\lib\site-packages\ray\tune\suggest\search_generator.py", line 192, in save_to_dir base_searcher.save_to_dir(dirpath, session_str) File "C:\ProgramData\Anaconda3\envs\ntanalysis\lib\site-packages\ray\tune\suggest\suggestion.py", line 210, in save_to_dir self.CKPT_FILE_TMPL.format(session_str))) FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\Users\User\ray_results\train_flat_price\.tmp_searcher_ckpt' -> 'C:\Users\User\ray_results\train_flat_price\searcher-state-2020-08-22_12-24-21.pkl'

cristiangofiar commented 4 years ago

Help to connect 2 PCs pls!

(base) C:\Users\Gofiar>ray start --address='address' --redis-password='pass' Traceback (most recent call last): File "c:\users\gofiar\anaconda3\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\gofiar\anaconda3\lib\runpy.py", line 85, in _run_code exec(code, runglobals) File "C:\Users\Gofiar\anaconda3\Scripts\ray.exe_main.py", line 7, in File "c:\users\gofiar\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 1237, in main return cli() File "c:\users\gofiar\anaconda3\lib\site-packages\click\core.py", line 764, in call return self.main(args, kwargs) File "c:\users\gofiar\anaconda3\lib\site-packages\click\core.py", line 717, in main rv = self.invoke(ctx) File "c:\users\gofiar\anaconda3\lib\site-packages\click\core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "c:\users\gofiar\anaconda3\lib\site-packages\click\core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "c:\users\gofiar\anaconda3\lib\site-packages\click\core.py", line 555, in invoke return callback(args, **kwargs) File "c:\users\gofiar\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 399, in start address, redis_address) File "c:\users\gofiar\anaconda3\lib\site-packages\ray\services.py", line 279, in validate_redis_address redis_address = address_to_ip(redis_address) File "c:\users\gofiar\anaconda3\lib\site-packages\ray\services.py", line 311, in address_to_ip ip_address = socket.gethostbyname(address_parts[0]) socket.gaierror: [Errno 11001] getaddrinfo failed

Dris101 commented 4 years ago

I had a similar issue: After ray start --head the head node prints:- -------------------- Ray runtime started. --------------------

Next steps To connect to this Ray runtime from another node, run ray start --address='192.168.143.221:6379' --redis-password='password here'

but address_to_ip(address): in services.py does not trim the quotes from the IP address so socket.gethostbyname(address_parts[0]) throws an error. The message from ray start is misleading. Try without the quotes around the IP address.

talhaanwarch commented 4 years ago

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/dashboard/client && npm ci && npm run build): 'C:\Users\crist\anaconda3\lib\site-packages\ray\dashboard\client/build'

Help!

any solution for this

mfitton commented 4 years ago

@talhaanwarch the dashboard currently does not work on Windows. I recommend passing include_dashboard=False when calling ray.init()

TekpreXyz commented 3 years ago

ok so Render via Ray dashboard don't work on Windows 10 there are any other way to see the process working?

kuangsangudu commented 3 years ago

Hello, I run the code ray.init(), then I got a error. Could you please tell me how to solve this problem? Here is the error.

Traceback (most recent call last): File "", line 1, in File "D:\conda\lib\site-packages\ray\worker.py", line 722, in init ray_params=ray_params) File "D:\conda\lib\site-packages\ray\node.py", line 216, in init self.start_head_processes() File "D:\conda\lib\site-packages\ray\node.py", line 767, in start_head_processes self.start_redis() File "D:\conda\lib\site-packages\ray\node.py", line 590, in start_redis self.get_resource_spec(), File "D:\conda\lib\site-packages\ray\node.py", line 314, in get_resource_spec is_head=self.head, node_ip_address=self.node_ip_address) File "D:\conda\lib\site-packages\ray\resource_spec.py", line 165, in resolve num_gpus = _autodetect_num_gpus() File "D:\conda\lib\site-packages\ray\resource_spec.py", line 252, in _autodetect_num_gpus lines = subprocess.check_output(cmdargs).splitlines()[1:] File "D:\conda\lib\subprocess.py", line 411, in check_output *kwargs).stdout File "D:\conda\lib\subprocess.py", line 488, in run with Popen(popenargs, kwargs) as process: File "D:\conda\lib\subprocess.py", line 800, in init restore_signals, start_new_session) File "D:\conda\lib\subprocess.py", line 1207, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system can not find the file specified。**

mehrdadn commented 3 years ago

@kuangsangudu I think you need to put C:\Windows\System32\WBEM in your PATH.

iuming commented 3 years ago

Hello, I also encountered this problem. I want to run ray on two Windows systems. When I run ray start --head on one computer, the following prompt appears:

Local node IP: 192.168.195.134
2021-02-05 13:17:36,016 INFO services.py:1171 -- View the Ray dashboard at http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='192.168.195.134:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

Then when I ran the ray start --address=192.168.195.134:6379 --redis-password='5241590000000000' command on another computer, the following message appeared:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 640, in wait_for_redis_to_start
    redis_client.client_list()
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 567, in connect
    self.on_connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 484, in read_response
    raise response
redis.exceptions.AuthenticationError: invalid password

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\programdata\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\ray.exe\__main__.py", line 7, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 1504, in main
    return cli()
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 627, in start
    services.wait_for_redis_to_start(
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 650, in wait_for_redis_to_start
    raise RuntimeError("Unable to connect to Redis at {}:{}.".format(
RuntimeError: Unable to connect to Redis at 192.168.195.134:6379.

Have you encountered this situation? Can you help me see how to solve it?

mehrdadn commented 3 years ago

@iuming: I'm not sure, but it doesn't sound like invalid password is Windows-specific. If you've double-checked the password is correct, try posting another Issue?

iuming commented 3 years ago

@mehrdadn I copied the prompt command to make sure the password is correct. But this error message still appears!

mehrdadn commented 3 years ago

@iuming: This issue is only for Windows issues, but I don't see anything indicating yours is Windows-specific. Try opening a new Issue?

iuming commented 3 years ago

@mehrdadn Okay, I will open a new issue.

stefanbschneider commented 3 years ago

I'm also having issues with the Ray dashboard both on Windows 10 and WSL. When starting, Ray prints View the Ray dashboard at http://localhost:8265 but when visiting localhost:8265 the page load fails.

Is this the current expected behavior for Windows? And also for WSL?

Also, if I run ray status --address <head-node-ip>:6379, I get the following error:

WARNING: Logging before InitGoogleLogging() is written to STDERR
F0210 18:34:48.054060   213   213 service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was 131.234.28.107:35699
*** Check failure stack trace: ***
Aborted (core dumped)
stefanbschneider commented 3 years ago

Ah, I just found that running ray dashboard cluster.yaml solves my dashboard problem! I can now access the dashboard locally at http://localhost:8265/ on my Windows laptop. The cluster itself (head and worker nodes) are Linux machines though.

I'm running ray up and ray dashboard on WSL on my laptop.

iuming commented 3 years ago

@stefanbschneider If you are on a Windows system, if you run ray start --head first and then ray dashboard cluster.yaml, can dashboard be displayed at http://localhost:8265/? Why does the following error message appear after I run ray dashboard cluster.yaml:

Attempting to establish dashboard locally at localhost:8265 connected to remote port 8265
Error: Failed to forward dashboard from remote port 8265 to local port 8265. There are a couple possibilities:
 1. The remote port is incorrectly specified
 2. The local port 8265 is already in use.
 The exception is: [Errno 2] No such file or directory: 'cluster.yaml'
stefanbschneider commented 3 years ago

cluster.yaml is just what I called my cluster configuration file, which is based on the example here. You'll have to adjust this to the name/path of your config. Apparently, it's not cluster.yaml.

iuming commented 3 years ago

@stefanbschneider It turned out to be so, thank you!

diman82 commented 3 years ago

@talhaanwarch the dashboard currently does not work on Windows. I recommend passing include_dashboard=False when calling ray.init()

Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10):

My code:


    def test_run_e2e_hyperparam_search_mini_cluster_ray_distributed(self):

        from ray.cluster_utils import Cluster

        # Starts a head-node for the cluster.

        cluster = Cluster(

            initialize_head=True,

            head_node_args={

                "num_cpus": 1,

            })

        ray.init(address=cluster.address, include_dashboard=False)

And this is the error:


2021-02-11 14:07:29,597 INFO View the Ray dashboard at http://127.0.0.1:8265

2021-02-11 14:07:30,732               INFO worker.py:656 -- Connecting to existing Ray cluster at address: 10.240.194.92:6379

2021-02-11 14:07:31,152               WARNING worker.py:1034 -- The actor or task with ID df5a1a828c9685d3ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

2021-02-11 14:07:40,106               WARNING worker.py:1034 -- The dashboard on node TLVCMEW001410 failed with the following error:

Traceback (most recent call last):

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 187, in <module>

    dashboard = Dashboard(

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 81, in __init__

    build_dir = setup_static_dir()

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 38, in setup_static_dir

    raise OSError(

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'C:\\Users\\dm57337\\.conda\\envs\\py38tf\\lib\\site-packages\\ray\\new_dashboard\\client\\build'
mcflem06 commented 3 years ago

Does the dashboard still not work for windows users? Can't connect to the dash on windows 10. Have tried disabling firewall etc...

weigao-123 commented 3 years ago

Hello, I also encountered this problem. I want to run ray on two Windows systems. When I run ray start --head on one computer, the following prompt appears:

Local node IP: 192.168.195.134
2021-02-05 13:17:36,016 INFO services.py:1171 -- View the Ray dashboard at http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='192.168.195.134:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

Then when I ran the ray start --address=192.168.195.134:6379 --redis-password='5241590000000000' command on another computer, the following message appeared:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 640, in wait_for_redis_to_start
    redis_client.client_list()
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 567, in connect
    self.on_connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 484, in read_response
    raise response
redis.exceptions.AuthenticationError: invalid password

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\programdata\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\ray.exe\__main__.py", line 7, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 1504, in main
    return cli()
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 627, in start
    services.wait_for_redis_to_start(
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 650, in wait_for_redis_to_start
    raise RuntimeError("Unable to connect to Redis at {}:{}.".format(
RuntimeError: Unable to connect to Redis at 192.168.195.134:6379.

Have you encountered this situation? Can you help me see how to solve it?

@iuming I had the same problem, and what @Dris101 said is correct, so try without quotes for both IP address and password. Then it works for me.

iuming commented 3 years ago

Does the dashboard still not work for windows users? Can't connect to the dash on windows 10. Have tried disabling firewall etc...

@mcflem06 Thank you very much! I am sure I have turned off the firewall.

iuming commented 3 years ago

Hello, I also encountered this problem. I want to run ray on two Windows systems. When I run ray start --head on one computer, the following prompt appears:

Local node IP: 192.168.195.134
2021-02-05 13:17:36,016 INFO services.py:1171 -- View the Ray dashboard at http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='192.168.195.134:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

Then when I ran the ray start --address=192.168.195.134:6379 --redis-password='5241590000000000' command on another computer, the following message appeared:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 640, in wait_for_redis_to_start
    redis_client.client_list()
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')
  File "c:\programdata\anaconda3\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 567, in connect
    self.on_connect()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "c:\programdata\anaconda3\lib\site-packages\redis\connection.py", line 484, in read_response
    raise response
redis.exceptions.AuthenticationError: invalid password

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\programdata\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\ray.exe\__main__.py", line 7, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 1504, in main
    return cli()
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\ray\scripts\scripts.py", line 627, in start
    services.wait_for_redis_to_start(
  File "c:\programdata\anaconda3\lib\site-packages\ray\_private\services.py", line 650, in wait_for_redis_to_start
    raise RuntimeError("Unable to connect to Redis at {}:{}.".format(
RuntimeError: Unable to connect to Redis at 192.168.195.134:6379.

Have you encountered this situation? Can you help me see how to solve it?

@iuming I had the same problem, and what @Dris101 said is correct, so try without quotes for both IP address and password. Then it works for me.

@weigao-123 Thanks for your suggestions, I will try it.

kk-55 commented 3 years ago

Get stuck in ray.init()

Problem: Running the code example below, the process gets stuck in ray.init() and nothing else happens (no error or warning messages). What could be the problem? Under my WSL (Ubuntu 20.04) all works fine, but performance slows down and thus I prefer to run ray/RLlib under Windows.

Information: OS: Microsoft Windows 10 Pro, version 10.0.19042 Build 19042 Python: 3.8.5 64-bit Ray: 1.2.0

Reproduction script:

import ray

print("start")

ray.init(include_dashboard=False)

print("end")
iuming commented 3 years ago

@weigao-123 Sorry, after I removed the quotation marks, the following error occurred: When I enter ray start --head on one computer and ray start --address=192.168.1.121:6379 --redis-password=5241590000000000 on another computer,

Local node IP: 192.168.1.116
Traceback (most recent call last):
  File "e:\conda\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "e:\conda\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "E:\conda\Scripts\ray.exe\__main__.py", line 7, in <module>
  File "e:\conda\lib\site-packages\ray\scripts\scripts.py", line 1519, in main
    return cli()
  File "e:\conda\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "e:\conda\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "e:\conda\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "e:\conda\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "e:\conda\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "e:\conda\lib\site-packages\ray\scripts\scripts.py", line 651, in start
    node = ray.node.Node(
  File "e:\conda\lib\site-packages\ray\node.py", line 156, in __init__
    self._init_temp(redis_client)
  File "e:\conda\lib\site-packages\ray\node.py", line 254, in _init_temp
    self._temp_dir = ray.utils.decode(temp_dir)
  File "e:\conda\lib\site-packages\ray\utils.py", line 176, in decode
    return byte_str.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 9: ordinal not in range(128)
weigao-123 commented 3 years ago

@weigao-123 Sorry, after I removed the quotation marks, the following error occurred: When I enter ray start --head on one computer and ray start --address=192.168.1.121:6379 --redis-password=5241590000000000 on another computer,

Local node IP: 192.168.1.116
Traceback (most recent call last):
  File "e:\conda\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "e:\conda\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "E:\conda\Scripts\ray.exe\__main__.py", line 7, in <module>
  File "e:\conda\lib\site-packages\ray\scripts\scripts.py", line 1519, in main
    return cli()
  File "e:\conda\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "e:\conda\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "e:\conda\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "e:\conda\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "e:\conda\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "e:\conda\lib\site-packages\ray\scripts\scripts.py", line 651, in start
    node = ray.node.Node(
  File "e:\conda\lib\site-packages\ray\node.py", line 156, in __init__
    self._init_temp(redis_client)
  File "e:\conda\lib\site-packages\ray\node.py", line 254, in _init_temp
    self._temp_dir = ray.utils.decode(temp_dir)
  File "e:\conda\lib\site-packages\ray\utils.py", line 176, in decode
    return byte_str.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 9: ordinal not in range(128)

@iuming I think this is probably because of your own environment, e.g. the language you use is non-english or something, and you can easily find more information and solutions online. A quick explaination: https://github.com/odoo/odoo/issues/773

kk-55 commented 3 years ago

Get stuck in ray.init()

Problem: Running the code example below, the process gets stuck in ray.init() and nothing else happens (no error or warning messages). What could be the problem? Under my WSL (Ubuntu 20.04) all works fine, but performance slows down and thus I prefer to run ray/RLlib under Windows.

Information: OS: Microsoft Windows 10 Pro, version 10.0.19042 Build 19042 Python: 3.8.5 64-bit Ray: 1.2.0

Reproduction script:

import ray

print("start")

ray.init(include_dashboard=False)

print("end")

Any ideas or related issues? TIA!

mehrdadn commented 3 years ago

@kk-55 I'm not sure. I'd recommend finding the latest versions of Ray & Python that work, and posting them here to help the team look into it.

kk-55 commented 3 years ago

@mehrdadn Do you mean combining https://docs.ray.io/en/master/installation.html#daily-releases-nightlies and Windows Python 3.8.5 64-bit? And thereafter hoping for further help from the team?

mehrdadn commented 3 years ago

@kk-55 Yup.

kk-55 commented 3 years ago

Ray can't be initialized

Problem description Running the repo script below, ray can't be initialized. No error/warning occurs and it just won't terminate (always ends up in line ret = ray.init()). Console/debugger outputs look like this: grafik

grafik

What can I do or what could be the problem?

System information OS: Windows 10 Pro, version 10.0.19042 Build 19042 Ray: lastest nightly wheel for Windows Python 3.8 https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-win_amd64.whl Python: 3.8.5 64-bit

Repo script

import ray

print("start")

print(ray.is_initialized())

ret = ray.init()

print(ret)

print(ray.is_initialized())

ray.shutdown()

print("end")
mehrdadn commented 3 years ago

@kk-55: Sorry, what I was saying was, try to find the latest nightly wheel that does work correctly. Not one that's broken. That way the team can look at what changes might have occurred in the subsequent commit.

kk-55 commented 3 years ago

@mehrdadn Sorry, but is there a wheel that had previously worked? Or do you know which one to try first?

mehrdadn commented 3 years ago

@kk-55: I don't know about Python 3.8.5 in particular, but of course the Windows wheels have been working for some time. This user got a wheel working on Windows, for example, but I don't know what commit it was. Have you ever managed to run it successfully on any version of Python in the past? Or is this the first time you're trying to run Ray on Windows?

kk-55 commented 3 years ago

@mehrdadn Not yet, it's the first time I try to run Ray on Windows. I think that user got a wheel working on Windows simply took the wheel for the latest release. I also changed to Python 3.8.7 and tried to reproduce, but ray.init() still gets stuck w/o any prompt.

mehrdadn commented 3 years ago

@kk-55: I see. I just pip installed the latest version of Ray on Python 3.8.7 and verified that it runs correctly, so something seems to be wrong on your machine. I would say try older commits until you find one that works. (Binary search might be helpful here.) Then post whatever you find as a new issue (not here).

VinnyLiu0817 commented 3 years ago

Ray can't be initialized while connected to redis System information OS: Windows 10 Ray: 1.2.0 Python: 3.7 64-bit

Problem description: Running the script on a github program about DML and I built a simple redis cluster with 3 nodes and 3 slaves on my local laptop, but ray can't be initialized when it tries to connect to the redis cluster. I am pretty sure that I open the redis cluster mode.
Console/debugger outputs look like this: move error

mehrdadn commented 3 years ago

Please post a new issue if you encounter any problems on Windows. I think I'm going to close this one. This issue was mainly intended as a reference table for existing Windows issues and to discuss what should be on the table, not as a separate place to post Windows-specific issues. Thanks everyone!

jhn-nt commented 3 years ago

Runing exeriment with HyperOptSearch and LightGBM, and receive rror message.

raise TuneError("Trials did not complete", incomplete_trials)

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\Users\User\ray_results\train_flat_price\train_flat_price_1_bagging_fraction=tune.sample_from(<function uniform.. at 0x000001C05ECACE58>),feature_fraction=_2020-08-20_18-00-270k00linp'

Had the same issue on my machine: OSError: [WinError 123] La sintassi del nome del file, della directory o del volume non è corretta: "C:\Users\***\ray_results\hello\VanillaGan_47297_00000_0_batch_size=32,d_optimizer=<class 'tensorflow.python.keras.optimizer_v2.adam.Adam'>,d_rate=0.00040768,gene_2021-04-21_15-34-21"

In my case I believe it was due to restrinctions in folder names on windows. To solve it I just had to add a regex filter in the create_logdir method in ray/tune/trial (line 137) to remove restricted characters. Everything seems to works fine afterwords

bramhoven commented 3 years ago

Hi,

I have been trying to get ray working on Windows for a few days now, but I keep running into the same problem. Ray keeps hanging on init. The following error message is logged in the worker log:

[2021-05-27 23:27:35,441 E 25468 3004] core_worker.cc:390: Failed to register worker 11baac3114b0e5ec6797733be05ecfeeb3cca79520cff01f14712d28 to Raylet. Invalid: Invalid: Unknown worker

The following is logged in raylet.out:

[2021-05-27 23:34:14,449 I 22100 25172] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,661 I 22100 25172] store_runner.cc:29: Allowing the Plasma store to use up to 1.85846GB of memory.
[2021-05-27 23:34:14,662 I 22100 25172] store_runner.cc:42: Starting object store with directory C:\Users\Bram\AppData\Local\Temp and huge page support disabled
[2021-05-27 23:34:14,664 I 22100 25172] grpc_server.cc:71: ObjectManager server started, listening on port 51896.
[2021-05-27 23:34:14,666 I 22100 25172] node_manager.cc:230: Initializing NodeManager with ID 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5
[2021-05-27 23:34:14,666 I 22100 25172] grpc_server.cc:71: NodeManager server started, listening on port 51898.
[2021-05-27 23:34:14,786 I 22100 25172] raylet.cc:146: Raylet of id, 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5 started. Raylet consists of node_manager and object_manager. node_manager address: 192.168.0.25:51898 object_manager address: 192.168.0.25:51896 hostname: 192.168.0.25
[2021-05-27 23:34:14,787 I 22100 15128] agent_manager.cc:76: Monitor agent process with pid 24972, register timeout 30000ms.
[2021-05-27 23:34:14,792 I 22100 25172] service_based_accessor.cc:579: Received notification for node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, IsAlive = 1
[2021-05-27 23:34:15,544 I 22100 25172] worker_pool.cc:289: Started worker process of 1 worker(s) with pid 18128
[2021-05-27 23:34:16,228 W 22100 25172] worker_pool.cc:418: Received a register request from an unknown worker 22252
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1132: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1146: Ignoring client disconnect because the client has already been disconnected.
[2021-05-27 23:34:26,551 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,612 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,700 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,788 W 22100 25172] agent_manager.cc:82: Agent process with pid 24972 has not registered, restart it.
[2021-05-27 23:34:44,789 W 22100 15128] agent_manager.cc:92: Agent process with pid 24972 exit, return value 1067
[2021-05-27 23:34:45,545 I 22100 25172] worker_pool.cc:315: Some workers of the worker process(18128) have not registered to raylet within timeout.
[2021-05-27 23:34:45,793 I 22100 18252] agent_manager.cc:76: Monitor agent process with pid 25032, register timeout 30000ms.

And this is logged in gcs_server.out:

[2021-05-27 23:34:14,163 I 7164 3876] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,165 I 7164 3876] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:44: Loading job table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:56: Loading node table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:68: Loading object table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:81: Loading cluster resources table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:108: Loading actor table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:94: Loading placement group table data.
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:48: Finished loading job table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:60: Finished loading node table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:73: Finished loading object table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:85: Finished loading cluster resources table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:112: Finished loading actor table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:99: Finished loading placement group table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=300
[2021-05-27 23:34:14,385 I 7164 3876] grpc_server.cc:71: GcsServer server started, listening on port 51888.
[2021-05-27 23:34:14,391 I 7164 3876] gcs_server.cc:276: Gcs server address = 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:280: Finished setting gcs server address: 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:34: Registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:39: Finished registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:93: Getting all job info.
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:99: Finished getting all job info.
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:26: Adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:36: Finished adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:26,246 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,310 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,398 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:54,467 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:04,538 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 3, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}

I am trying to start ray using this in python:

ray.init(local_mode=True, include_dashboard=False, num_gpus=1, num_cpus=1, logging_level=logging.DEBUG)

Python version: 3.7.8 Ray version: latest windows release

mehrdadn commented 3 years ago

Please post a new issue if you encounter any problems on Windows! Thank you!