Custom action_sampler_fn is not working for PPO.

n30111 commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

PPO does not work while using action_sampler_fn and make_model.

 File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 587, in __init__
    self._build_policy_map(
  File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1551, in _build_policy_map
    self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
  File "/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 133, in create_policy
    self[policy_id] = class_(
  File "/lib/python3.8/site-packages/ray/rllib/policy/tf_policy_template.py", line 238, in __init__
    DynamicTFPolicy.__init__(
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 376, in __init__
    self._initialize_loss_from_dummy_batch(
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 649, in _initialize_loss_from_dummy_batch
    losses = self._do_loss_init(train_batch)
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 731, in _do_loss_init
    losses = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "/python3.8/site-packages/ray/rllib/agents/ppo/ppo_tf_policy.py", line 56, in ppo_surrogate_loss
    curr_action_dist = dist_class(logits, model)
TypeError: 'NoneType' object is not callable

Versions / Dependencies

Python =3.8 ray==1.9.2

Reproduction script

from typing import Any, Optional, Tuple, Type

import numpy as np
import ray
from gym.spaces import Box, Space
from ray.rllib.agents.ppo import ppo, ppo_tf_policy
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.tf_policy_template import build_tf_policy
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.utils.spaces.simplex import Simplex
from ray.rllib.utils.tf_utils import zero_logps_from_actions
from ray.rllib.utils.typing import TensorType, TrainerConfigDict

tf1, tf, tfv = try_import_tf()

def make_model(
    policy: Policy, obs_space: Space, action_space: Space, config: TrainerConfigDict
) -> ModelV2:
    if isinstance(action_space, Box):
        num_outputs = 2 * np.product(action_space.shape)
    model = ModelCatalog.get_model_v2(
        obs_space=obs_space,
        action_space=action_space,
        num_outputs=num_outputs,
        model_config=config["model"],
        framework="tf",
    )
    policy.dist_class, _ = ModelCatalog.get_action_dist(action_space, config["model"])
    return model

def action_sampler_fn(
    policy: Policy,
    model: ModelV2,
    obs_batch: TensorType,
    explore: bool = True,
    state_batches: Optional[TensorType] = None,
    seq_lens: Optional[TensorType] = None,
    prev_action_batch: Optional[TensorType] = None,
    prev_reward_batch: Optional[TensorType] = None,
    **kwargs: Any,
) -> Tuple[TensorType, TensorType]:
    distribution_inputs, policy._state_out = policy.model(
        {
            SampleBatch.OBS: obs_batch,
            "obs_flat": obs_batch,
            "is_training": policy._get_is_training_placeholder(),
            SampleBatch.PREV_ACTIONS: prev_action_batch,
            SampleBatch.PREV_REWARDS: prev_reward_batch,
        },
        state_batches,
        seq_lens,
    )
    action_dist_class, _ = ModelCatalog.get_action_dist(policy.action_space, policy.config["model"])
    action_dist = action_dist_class(distribution_inputs, model)
    action = action_dist.deterministic_sample()

    logp = zero_logps_from_actions(action)
    return action, logp

PPOTFPolicy = build_tf_policy(
    name="PPOTFPolicy",
    loss_fn=ppo_tf_policy.ppo_surrogate_loss,
    make_model=make_model,
    action_sampler_fn=action_sampler_fn,
    get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
    postprocess_fn=ppo_tf_policy.compute_gae_for_sample_batch,
    stats_fn=ppo_tf_policy.kl_and_loss_stats,
    compute_gradients_fn=ppo_tf_policy.compute_and_clip_gradients,
    extra_action_out_fn=ppo_tf_policy.vf_preds_fetches,
    before_init=ppo_tf_policy.setup_config,
    before_loss_init=ppo_tf_policy.setup_mixins,
    mixins=[
        ppo_tf_policy.LearningRateSchedule,
        ppo_tf_policy.EntropyCoeffSchedule,
        ppo_tf_policy.KLCoeffMixin,
        ppo_tf_policy.ValueNetworkMixin,
    ],
)

DEFAULT_CONFIG = ppo.DEFAULT_CONFIG

PPOTrainer = build_trainer(
    name="PPO",
    default_config=ppo.DEFAULT_CONFIG,
    validate_config=ppo.validate_config,
    default_policy=PPOTFPolicy,
    execution_plan=ppo.execution_plan,
)

config = DEFAULT_CONFIG.copy()
trainer = PPOTrainer(config, env="LunarLanderContinuous-v2")
for i in range(250):
    trainer.train()

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

sven1977 commented 2 years ago

Hey @n30111 , thanks for raising this. I think the answer here is that in case you do want to use a action_sampler_fn (in which you take charge of action computation entirely w/o the help of the policy's built-in action-dist/sampling utilities), you have to make sure that your loss function handles this absence of an action-distribution class.

From looking at your action_sampler_fn, it seems that all you are trying to do is to return a deterministic action (instead of a sampled one from the distribution). You can also achieve that by setting config.explore=False in PPO. However, if you are trying to do more complex things in your custom action_sampler_fn, you would need to also re-define your loss to handle the dist_class=None issue.

To summarize RLlib's behavior:

action_sampler_fn defined: RLlib will NOT create an action-dist class for you; RLlib will NOT create an action_dist_inputs placeholder for you; you are responsible for coming up with actions from this custom function.
action_distribution_fn defined: Return an action-dist input tensor, a action-dist class, and state-outs (or []) from this custom function, RLlib will do the rest (sample from the given distribution class for action calculations).
None of the above: RLlib will come up with a default action distribution class and a default way to compute inputs to this distribution to sample actions from.

n30111 commented 2 years ago

Thanks @sven1977, we were trying to do additional things (not only deterministic) in the custom action_sampler_fn. Since it worked for SAC, I was expecting it would work for PPO also.

n30111 commented 2 years ago

I was able to make it work by making minor changes in the ppo_surrogate_loss and make_model, and with action_samper_fn output signature changes in the policy class. @sven1977 Can you please look at this commit https://github.com/minds-ai/ray/commit/eba38ecc8b7e4eeeacb95d1ca93a0c72a343b5d3 and let me know if RLLib will accept this change?

n30111 commented 2 years ago

Hi @sven1977 please let us know your thoughts on this.

gjoliver commented 2 years ago

added some comments to your commit. let's move the discussion there. thanks.

ray-project / ray