ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.01k stars 5.59k forks source link

[Rllib][Tune] AttributeError: 'TorchCategorical' object has no attribute 'log_prob' with PB2 #35923

Open norikazu99 opened 1 year ago

norikazu99 commented 1 year ago

What happened + What you expected to happen

The following error occurs when using pb2 with custom env of Multidiscrete action space. Custom environment does work when not using pb2.

(PPO pid=58072) File "python\ray_raylet.pyx", line 881, in ray._raylet.execute_task (PPO pid=58072) File "python\ray_raylet.pyx", line 821, in ray._raylet.execute_task.function_executor (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray_private\function_manager.py", line 670, in actor_method_executor (PPO pid=58072) return method(__ray_actor, *args, kwargs) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span (PPO pid=58072) return method(self, *_args, *_kwargs) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 738, in init (PPO pid=58072) self._update_policy_map(policy_dict=self.policy_dict) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span (PPO pid=58072) return method(self, _args, _kwargs) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1985, in _update_policy_map (PPO pid=58072) self._build_policy_map( (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span (PPO pid=58072) return method(self, *_args, *_kwargs) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 2097, in _build_policy_map (PPO pid=58072) new_policy = create_policy_for_framework( (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\utils\policy.py", line 142, in create_policy_for_framework (PPO pid=58072) return policy_class(observation_space, action_space, merged_config) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\algorithms\ppo\torch\ppo_torch_policy_rlm.py", line 82, in init (PPO pid=58072) self._initialize_loss_from_dummy_batch() (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\policy\policy.py", line 1405, in _initialize_loss_from_dummy_batch (PPO pid=58072) actions, state_outs, extra_outs = self.compute_actions_from_input_dict( (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 522, in compute_actions_from_input_dict (PPO pid=58072) return self._compute_action_helper( (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\utils\threading.py", line 32, in wrapper (PPO pid=58072) raise e (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper (PPO pid=58072) return func(self, a, **k) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1110, in _compute_action_helper (PPO pid=58072) logp = action_dist.logp(actions) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\models\torch\torch_distributions.py", line 324, in logp (PPO pid=58072) logps = torch.stack([cat.log_prob(act) for cat, act in zip(self._cats, value)]) (PPO pid=58072) File "C:\Users\badrr\anaconda3\envs\yeah\lib\site-packages\ray\rllib\models\torch\torch_distributions.py", line 324, in (PPO pid=58072) logps = torch.stack([cat.log_prob(act) for cat, act in zip(self._cats, value)]) (PPO pid=58072) AttributeError: 'TorchCategorical' object has no attribute 'log_prob'

Versions / Dependencies

Python = 3.10.11 Windows 11 pytorch-cuda = 11.7 pytorch = 2.0.1 ray, ray.tune, ray.rlib. ray.air = 2.4.0

Reproduction script

ray.init()
config = PPOConfig().environment(Custom_env)

config = config.resources(num_learner_workers=1, num_gpus_per_learner_worker=0.25, num_gpus_per_worker=0,
                         num_cpus_for_local_worker=1, num_cpus_per_learner_worker=0, num_cpus_per_worker=1,
                          placement_strategy="PACK") #"SPREAD"

config = config.rollouts(num_rollout_workers=1, num_envs_per_worker=1, create_env_on_local_worker=False, preprocessor_pref="rllib")

config = config.training(gamma=sample_from(lambda spec: random.uniform(0.9, 0.999)), lr= sample_from(lambda spec: random.uniform(1e-5,1e-3)), _enable_learner_api=True).rl_module(_enable_rl_module_api=True)

config.sgd_minibatch_size = 8192 
config.train_batch_size = 8192
config.num_sgd_iter = 5
pb2_scheduler = PB2(
    time_attr='training_iteration', 
    metric= 'episode_reward_mean',
    mode='max',
    perturbation_interval=5,
    hyperparam_bounds={
        "lr": [1e-5, 1e-3],
        "gamma": [0.9, 0.999],
    },
    quantile_fraction= 0.25,
    require_attrs=True,
    synch=False
)

tune.run("PPO", config=config, local_dir="pb2_test", scheduler=pb2_scheduler, num_samples=4,
        stop={"training_iteration":20}, callbacks=[WandbLoggerCallback(project="pb2_test", log_config=True, save_code=True)])

Issue Severity

High: It blocks me from completing my task.

Finebouche commented 1 year ago

Hi, getting same error for my custom multiagent environment with Multidiscrete action space.

Tried using different schedulers but got stuck with the same issue :

   pbt_scheduler = PopulationBasedTraining(
        time_attr='training_iteration',
        metric="episode_reward_mean",
        mode="max",
        perturbation_interval=5,
        quantile_fraction=0.25,
        # Specifies the search space for these hyperparams
        hyperparam_mutations={
            "lambda": lambda: random.uniform(0.9, 1.0),
            "clip_param": lambda: random.uniform(0.1, 0.5),
            "lr": lambda: random.uniform(1e-3, 1e-5),
            "train_batch_size": lambda: random.randint(1000, 60000),
        },
    )

    pb2_scheduler = PB2(
        time_attr='training_iteration', 
        metric= 'episode_reward_mean',
        mode='max',
        perturbation_interval=5,
        hyperparam_bounds={
            "lr": [1e-5, 1e-3],
            "gamma": [0.9, 0.999],
        },
        quantile_fraction= 0.25,
        require_attrs=True,
        synch=False
    )