ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[Bug] [RLlib] Tensorboard w/ RLlib not plotting any data at all #23582

Open Arcadianlee opened 2 years ago

Arcadianlee commented 2 years ago

Search before asking

Ray Component

RLlib

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

Hi, my issue is that the result files (e.g. result.json, progress.csv) inside the _rayresults folder are all empty, meaning no training data were saved to TensorBoard during the training process. This leads to TensorBoard not plotting anything at all. I'm using rllib's DQN trainer (the torch version) with TensorBoard (not TensorboardX) on a linux CentoOS machine. Any ideas why this happens?

code: trainer = dqn.DQNTrainer(env = FdtdEnv, config = config) num_episodes = 500 reward_threshold=1000.0 for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

3681648564682_ pic

3751648565118_ pic

Versions / Dependencies

TensorBoard 2.6.0 Python 3.9.7 Ray 1.11.0 CentOS 7.9.2009

Reproduction script

trainer = dqn.DQNTrainer(env = FdtdEnv, config = config)

main training loop

num_episodes = 500 tempRew = -1000 lastScore = 0 maxScore = [] reward_threshold=1000.0

for i_episode in range(num_episodes):

print('\nStarting episode No.{}'.format(i_episode+1))
results = trainer.train()

#if i_episodes % 25 == 0:
   #checkpoint = trainer.save()

print(pretty_print(results))

if results["episode_reward_mean"] >= reward_threshold:
   print('\nSolved! Episode: {}, Steps: {}, Current_state: {}, Current_score: {}\n'.format(
            i_episode, results["agent_timesteps_total"],  next_state, results["episode_reward_mean"] ))
   break

Anything else

No response

Are you willing to submit a PR?

Arcadianlee commented 2 years ago

update: installed latest version of TensorBoardX, and the problem persists.

krfricke commented 2 years ago

Hi @Arcadianlee, can you try running your training using tune.run()? See e.g. https://docs.ray.io/en/latest/rllib/rllib-training.html#basic-python-api

gresavage commented 10 months ago

You and I may be having similar issues.

After following @krfricke comment please also make sure builder.py is present in the site-packages for protobuf. There is an issue with protobuf<3.20 where builder.py is missing which ultimately causes the TF event files to be empty. See here for more details.

Unfortunately the dependencies for tensorflow work out in a way such that we are stuck with a problematic release of protobuf

If you find you're still having issues after that then I think you and I may be experiencing the same or related issues. In my case, however, I get some initial data in the TF event files and progress.csv, but after a seemingly arbitrary number of iterations all of the files under the trial directory are present but completely empty. I've attached some screenshots to demonstrate how data was being recorded in tensorboard and the TF event files/progress.csv up to a point but now all the files are mysteriously empty.

I will try to get a minimum working example script attached to this thread soon - FWIW I always use Tune.run() for my experiments and have recently been doing a lot of testing with the new RL Module and Learner APIs... I cannot remember at this moment whether the issue occurs under the old ModelV2 API

This was not an issue I had experienced prior with Ray 2.6 or below. Please LMK if this seems similar to your issue, otherwise I will open a separate issue for the problems I'm having.

image

image

gresavage commented 10 months ago

Also here are segments of the tracebacks from tune - all for the same experiment. This trial errored initially for other reasons, but when Tune tried to resume/continue it resulted in these different errors. I think it's important to note that even if one of my Tune trial doesn't error the aforementioned files are empty, and that these errors are referring to the existence of checkpoint files and some strange behavior with importlib so they may be a useful breadcrumb:

2023-10-19 23:38:55,946 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::VFDPPO.restore() (pid=2211586, ip=192.168.86.56, actor_id=f26abfbcb38b2bc4534378d401000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 976, in restore
    self.load_checkpoint(checkpoint_dir)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/corl/experiments/rllib_experiment.py", line 528, in load_checkpoint
    super(trainer_class, cls).load_checkpoint(checkpoint_path)  # type: ignore
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2152, in load_checkpoint
    self.__setstate__(checkpoint_data)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2595, in __setstate__
    self.workers.local_worker().set_state(state["worker"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1454, in set_state
    self.policy_map[pid].set_state(policy_state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_mixins.py", line 114, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1091, in set_state
    super().set_state(state)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 1059, in set_state
    policy_spec = PolicySpec.deserialize(state["policy_spec"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/policy/policy.py", line 161, in deserialize
    policy_class = get_policy_class(spec["policy_class"])
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/rllib/algorithms/registry.py", line 451, in get_policy_class
    module = importlib.import_module("ray.rllib.algorithms." + path)
TypeError: can only concatenate str (not "ABCMeta") to str

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:38:55. Total running time: 26min 52s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt
2023-10-19 23:39:34,036 ERROR tune_controller.py:1502 -- Trial task failed for trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000
Traceback (most recent call last):
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::VFDPPO.restore() (pid=2213824, ip=192.168.86.56, actor_id=06d75ce2034489b567e20a3d01000000, repr=VFDPPO)
  File "/home/tgresavage/mambaforge/envs/vfd_env_2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 954, in restore
    if not _exists_at_fs_path(checkpoint.filesystem, checkpoint.path):
AttributeError: 'NoneType' object has no attribute 'filesystem'

Trial LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000 errored after 39 iterations at 2023-10-19 23:39:34. Total running time: 27min 30s
Error file: /tmp/data/VFD/PPO-MT/LL/ray_results/LL-PPO-MT/LL-PPO-MT-VFDPPO_CorlMultiAgentEnv_70f3f_00000_0_2023-10-19_23-12-03/error.txt

Be aware that the error.txt file mentioned is also empty.

Arcadianlee commented 10 months ago

I am beginning to suspect that tune.run() isn't compatible with tensorboard.