ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.94k stars 5.77k forks source link

[Tune] Cannot run the QuickStart example code on windows after installing Ray in conda enviroment, reporting FileNotFoundError #46827

Open ksnof opened 3 months ago

ksnof commented 3 months ago

What happened + What you expected to happen

Hi, I have installed the Ray by using pip in a conda environment:

pip install -U "ray[default]" pip install -U "ray[data,train,tune,serve]"

After installing I switched to Pycharm and tried to run the QuickStart example in the python console, and then I got this FileNotFound error, could you help me out there? Thank you

Following is the console output:

PyDev console: using IPython 8.12.0

Python 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)] on win32

In [2]: from ray import train, tune ...: ...: ...: def objective(config): # ① ...: score = config["a"] ** 2 + config["b"] ...: return {"score": score} ...: ...: ...: search_space = { # ② ...: "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]), ...: "b": tune.choice([1, 2, 3]), ...: } ...: ...: tuner = tune.Tuner(objective, param_space=search_space) # ③ ...: ...: results = tuner.fit() ...: print(results.get_best_result(metric="score", mode="min").config)

2024-07-28 17:25:56,809 INFO worker.py:1781 -- Started a local Ray instance. 2024-07-28 17:26:00,078 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). ╭──────────────────────────────────────────────────────────────────╮ │ Configuration for experiment objective_2024-07-28_17-25-49 │ ├──────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler FIFOScheduler │ │ Number of trials 4 │ ╰──────────────────────────────────────────────────────────────────╯

View detailed results here: C:/Users/sykdr/ray_results/objective_2024-07-28_17-25-49 To visualize your results with TensorBoard, run: tensorboard --logdir C:/Users/sykdr/AppData/Local/Temp/ray/session_2024-07-28_17-25-53_839841_18676/artifacts/2024-07-28_17-26-00/objective_2024-07-28_17-25-49/driver_artifacts

Trial status: 4 PENDING Current time: 2024-07-28 17:26:14. Total running time: 0s Logical resource usage: 4.0/12 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:G) ╭────────────────────────────────────────────────╮ │ Trial name status b a │ ├────────────────────────────────────────────────┤ │ objective_b186a_00000 PENDING 3 0.001 │ │ objective_b186a_00001 PENDING 1 0.01 │ │ objective_b186a_00002 PENDING 1 0.1 │ │ objective_b186a_00003 PENDING 2 1 │ ╰────────────────────────────────────────────────╯ (pid=34932) (pid=23860) (pid=34900) (pid=29072)

Trial objective_b186a_00000 started with configuration: ╭──────────────────────────────────────────────╮ │ Trial objective_b186a_00000 config │ ├──────────────────────────────────────────────┤ │ a 0.001 │ │ b 3 │ ╰──────────────────────────────────────────────╯

Traceback (most recent call last): File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\record_writer.py", line 58, in open_file factory = REGISTERED_FACTORIES[prefix]


KeyError: 'C'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\IPython\core\interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-3f24e1ff21da>", line 16, in <module>
    results = tuner.fit()
              ^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\tuner.py", line 377, in fit
    return self._local_tuner.fit()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\impl\tuner_internal.py", line 476, in fit
    analysis = self._fit_internal(trainable, param_space)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\impl\tuner_internal.py", line 592, in _fit_internal
    analysis = run(
               ^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\tune.py", line 994, in run
    runner.step()
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\execution\tune_controller.py", line 685, in step
    if not self._actor_manager.next(timeout=0.1):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\air\execution\_internal\actor_manager.py", line 221, in next
    self._actor_state_events.resolve_future(future)
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\air\execution\_internal\event_manager.py", line 118, in resolve_future
    on_result(result)
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\air\execution\_internal\actor_manager.py", line 380, in on_actor_start
    self._actor_start_resolved(
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\air\execution\_internal\actor_manager.py", line 242, in _actor_start_resolved
    tracked_actor._on_start(tracked_actor)
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\execution\tune_controller.py", line 1131, in _actor_started
    self._callbacks.on_trial_start(
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\callback.py", line 398, in on_trial_start
    callback.on_trial_start(**info)
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\logger\logger.py", line 147, in on_trial_start
    self.log_trial_start(trial)
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\ray\tune\logger\tensorboardx.py", line 202, in log_trial_start
    self._trial_writer[trial] = self._summary_writer_cls(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\writer.py", line 300, in __init__
    self._get_file_writer()
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\writer.py", line 348, in _get_file_writer
    self.file_writer = FileWriter(logdir=self.logdir,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\writer.py", line 104, in __init__
    self.event_writer = EventFileWriter(
                        ^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\event_file_writer.py", line 106, in __init__
    self._ev_writer = EventsWriter(os.path.join(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\event_file_writer.py", line 43, in __init__
    self._py_recordio_writer = RecordWriter(self._file_name)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\record_writer.py", line 182, in __init__
    self._writer = open_file(path)
                   ^^^^^^^^^^^^^^^
  File "D:\Anaconda3\envs\torch2.0_CUDA11.8\Lib\site-packages\tensorboardX\record_writer.py", line 61, in open_file
    return open(path, 'wb')
           ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/sykdr/AppData/Local/Temp/ray/session_2024-07-28_17-25-53_839841_18676/artifacts/2024-07-28_17-26-00/objective_2024-07-28_17-25-49/driver_artifacts/objective_b186a_00000_0_a=0.0010,b=3_2024-07-28_17-26-14\\events.out.tfevents.1722180380.Lenovo-Legion5Yukai'

### Versions / Dependencies

OS: windows 11
Ray: 2.33
Python: 3.11

### Reproduction script

`pip install -U "ray[default]"`
`pip install -U "ray[data,train,tune,serve]"`

```python
from ray import train, tune

def objective(config):  # ①
    score = config["a"] ** 2 + config["b"]
    return {"score": score}

search_space = {  # ②
    "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
    "b": tune.choice([1, 2, 3]),
}

tuner = tune.Tuner(objective, param_space=search_space)  # ③

results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)

### Issue Severity

High: It blocks me from completing my task.
karstenddwx commented 2 months ago

It seems I'm facing the same issue. Trial standard metrics (result.json and progress.csv) are not written for each of the trials. Some trails have them a few trials not. It is a non-deterministic behavior. Mostly they are there but rarely not, for exactly the same training. I use standard callbacks.

FileNotFoundError('Could not fetch metrics for DQN_MA_SAE_model_fcnet_activation=relu,hiddens=[],n_step=6_2024-08-30_00-24-55_848afd15: both result.json and progress.csv were not found at -> trial location

OS: Red Hat 9.4 Ray: 2.34 Python: 3.9