ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.07k stars 5.6k forks source link

[tune] Can't reload a past experiment (pickling error in pyarrow?) #46740

Open wjn0 opened 1 month ago

wjn0 commented 1 month ago

What happened + What you expected to happen

I'm trying to run ExperimentAnalysis(old_experiment_directory). It's failing with the following stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/nfs/dhi_work/walter/project/project/evaluate.py", line 93, in <module>
    main(args)
  File "/mnt/nfs/dhi_work/walter/project/project/evaluate.py", line 69, in main
    model = get_best_model(args.ray_root, device)
  File "/mnt/nfs/dhi_work/walter/project/project/evaluate.py", line 22, in get_best_model
    analysis = ExperimentAnalysis(directory)
  File "/home/username/.cache/pypoetry/virtualenvs/project-L8WgGYaP-py3.10/lib/python3.10/site-packages/ray/tune/analysis/experiment_analysis.py", line 137, in __init__
    self.trials = trials or self._load_trials()
  File "/home/username/.cache/pypoetry/virtualenvs/project-L8WgGYaP-py3.10/lib/python3.10/site-packages/ray/tune/analysis/experiment_analysis.py", line 148, in _load_trials
    trial = Trial.from_json_state(trial_json_state, stub=True)
  File "/home/username/.cache/pypoetry/virtualenvs/project-L8WgGYaP-py3.10/lib/python3.10/site-packages/ray/tune/experiment/trial.py", line 1231, in from_json_state
    state = json.loads(json_state, cls=TuneFunctionDecoder)
  File "/usr/lib/python3.10/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
  File "/home/username/.cache/pypoetry/virtualenvs/project-L8WgGYaP-py3.10/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 39, in object_hook
    return self._from_cloudpickle(obj)
  File "/home/username/.cache/pypoetry/virtualenvs/project-L8WgGYaP-py3.10/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 43, in _from_cloudpickle
    return cloudpickle.loads(hex_to_binary(obj["value"]))
AttributeError: type object 'pyarrow._fs.LocalFileSystem' has no attribute '_reconstruct'

Possibly related: apache/arrow#40342 ?

This was a pretty expensive experiment to run, so it would be great to be able to load it up again.

Versions / Dependencies

Reproduction script

Unfortunately the issue is seemingly with the experiment files themselves, as they contain some pickled objects that can't be re-opened. I think my best bet is a workaround if someone has seen this issue before. I wouldn't expect new experiments to face the same issue.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

MoritzWillmann commented 1 month ago

As a workaround I'd suggest installing pyarrow<17