Agent returns action with np.nan when using extra_spaces.Simplex

System information

Linux Ubuntu 18.04
Ray installed with pip install ray
Ray 0.6.4
Python version 3.6.7
Exact command to reproduce: see section Source code / logs below.

Describe the problem

Context: My use case is to create custom gym.Env and solve those environments with ray agents. Ray has worked very well in my use case, where different agents have solved both continuous (gym.spaces.Box) and discrete (gym.spaces.Discrete) action spaces in custom environments.

Problem: Problems start when using the action space ray.rllib.models.extra_spaces.Simplex in the environment (see #4070). In particular, the problem is that the agent returns actions with some np.nan(s). However, I've done some diggings and logits are calculated correctly (before being mapped to action distribution with a softmax I guess), so the issue should be in the layer of logic in ray mapping logits to an action.

I want to remark that simply switching Simplex to Box in the code below allows PPO to solve the environment in a few minutes, so there must be something wrong with Simplex.

Motivation: Simplex does not seem to work properly at the moment but it would allow to solve many interesting use cases with constrained action spaces.

Question: Why is Simplex with PPO not working correctly in the self-contained and reproducible code snippet below?

Source code / logs

# Self contained reproducible example.
import ray
from ray import rllib
import gym
import numpy as np
ray.init()

class ContextualBanditSimplex(gym.Env):
    """Contextual bandit environment, optionally non-stationary."""

    def __init__(self, env_config: dict):
        """
        Parameters
        ----------
        k : int
            Number of arms.
        nr_iter_max : int
            Fixed episode length.
        std_dev : float
            Variance of the means sampled at the start of the episode.
            The higher the variance, the simpler the environment.

        References
        ----------
        https://gym.openai.com/docs/
        """
        # Inputs.
        self.k = env_config['k']
        self.nr_iter_max = env_config['nr_iter_max']
        self.std_dev = env_config['std_dev']

        # Spaces.
        # See: https://github.com/ray-project/ray/pull/4070
        self.action_space = ray.rllib.models.extra_spaces.Simplex(shape=(self.k, ))
        self.observation_space = gym.spaces.Box(-np.inf, +np.inf, (self.k, ), np.float32)

        # Reset at the start of an episode by ContextualBanditEnv.reset.
        self.nr_iter: int = None
        self.done: bool = None
        self.mean_reward: np.ndarray = None
        self.cov: np.ndarray = None
        self.state: np.ndarray = None

    def _sample_mean_reward(self):
        """Sample k numbers, representing the average daily return of each
        stock."""
        return np.random.normal(loc=self.std_dev, size=self.k)

    def _get_observation(self):
        """Sample reward from each leaver."""
        return np.random.multivariate_normal(self.mean_reward, self.cov)

    def reset(self):
        """It's your responsibility to call this method every time before the
        start of an episode."""
        self.nr_iter = 0
        self.done = False
        self.mean_reward = self._sample_mean_reward()
        self.cov = np.identity(self.k)
        self.state = self._get_observation()
        return self.state

    def step(self, action: int):
        """
        Parameters
        ----------
        action : int
            Integer presenting the i-th stock out of k in which you want to
            invest.

        Returns
        -------
        state : np.ndarray
            Next state, which consists in a sample from the return distribution
            from each stock.
        reward : float
            Return from your investment.
        done : bool
            Indicates whether the episode is ended. If so, call .reset() to
            start a new episode.
        info : dict
            An empty dict.
        """
        if self.done:
            raise ValueError("Episode is ended. Call ContextualBanditEnv.reset"
                             "to start a new episode.")
        if not self.action_space.contains(action):
            raise ValueError("Action {} does not belong to the action space {}."
                             "".format(action, self.action_space))
        reward = self._get_reward(action)
        self.nr_iter += 1
        self.done = self.nr_iter >= self.nr_iter_max
        self.state = self._get_observation()
        info = {}
        return self.state, reward, self.done, info

    def _get_reward(self, action):
        """Returns weighted average return of the portfolio."""
        if not self.action_space.contains(action):
            raise ValueError(
                "Action {} does not belong to action space {}."
                "".format(action, self.action_space)
            )
        rewards = self._get_observation()
        return (rewards * action).sum()

# Specify the custom env.
config = rllib.agents.ppo.DEFAULT_CONFIG.copy()
config['env'] = ContextualBanditSimplex
config['env_config'] = {'k': 5, 'nr_iter_max': 500, 'std_dev': 1}

# Instance the PPO agent.
agent = rllib.agents.ppo.PPOAgent(config, ContextualBanditSimplex)

# Sample a state from the environment to be fed in the policy to compute the action.
env = ContextualBanditSimplex(config['env_config'])
state = env.reset()

# (UNEXPECTED OUTPUT): Agent returns action with nans!
# It is not an exploding gradient problem because:
# 1- Training hasn't started yet.
# 2- I've decreased 'lr' just in case and the issue persists.
action = agent.compute_action(state)
print(action)
# [nan nan  0.  0.  0.]

# (UNEXPECTED OUTPUT): Action is still with the same nan issue, but logits seem are correct!
policy = agent.get_policy()
action, _, info = policy.compute_single_action(state, [])
print(action)
# [nan nan  0. nan  0.]
print(info)
# {'action_prob': nan, 'vf_preds': 0.005049658,
#  'logits': array([-0.00207247, -0.00186774,  0.00244786, -0.00415111,  0.00086402], dtype=float32)}

# (FAILS) As expected, `agent.train()` fails because agent returns actions with nans.
agent.train()
# [RayTaskError] ValueError: Action [ 0. nan  0. nan nan] does not 
# belong to the action space Simplex((10,); [1, 1, 1, 1, 1]).

@Szkered, any ideas here?

I've done more digging in the source code of ray and I've shed some light on why logits are well defined and actions probabilities are not.

The context is that the input of Dirichlet (the action distribution associated with Simplex) are logits, then passed to tf.distrib.Dirichlet in Dirichlet.__init__. So Dirichlet is parametrized by the logits of the Policy graph (I'm not sure if this is correct, probably not considering the output from the code below, but there we go).

Now, it turns out that internal parameters of tf.distrib.Dirichlet might be invalid, in which case a sample from the distribution return np.nan. See docs here) and example below.

with tf.Session() as session:
    logits = [-0.00207247, -0.00186774,  0.00244786, -0.00415111,  0.00086402]
    action_distrib = tf.distributions.Dirichlet(
        concentration=logits,
        allow_nan_stats=False,
    )
    sample = action_distrib.sample().eval()
    print(sample)

# Console:
[nan nan  0. nan  0.]

So we now know that it's the action distribution which created np.nan in the action probabilities. So the point seems to be that logits cannot directly feed to tf.distrib.Dirichlet, which is what's happening in the ray source code as of now. So now the question is: what is the right way to parametrize Dirichlet?

One possible solution could be to use the softmax of logits (see code below).

def softmax(x, axis=None):
    x = x - x.max(axis=axis, keepdims=True)
    y = np.exp(x)
    return y / y.sum(axis=axis, keepdims=True)

with tf.Session() as session:
    # Note that the concentration parameter is now the softmax logits,
    # as opposed to logits.
    action_distrib = tf.distributions.Dirichlet(
        concentration=softmax(logits),
        allow_nan_stats=False,
    )
    sample = action_distrib.sample().eval()
    print(sample)

# Console:
[2.39824214e-04 1.06406130e-05 6.64767340e-02 6.38899220e-02 8.69382879e-01]

QUESTION: could anyone confirm if providing the softmax of logits as the parametrization of Dirichlet makes sense? I'm not sure if this is mathematically sound. I would be happy to create a pull request if someone confirmed the solution. @Szkered @ericl

Based on https://en.m.wikipedia.org/wiki/Dirichlet_distribution, it might be undesirable to restrict the parameters to [0,1]. Perhaps an alternative parametrization would be to square each parameter independently?

Values of the concentration parameter above 1 prefer variates that are dense, evenly distributed distributions, i.e. all the values within a single sample are similar to each other. Values of the concentration parameter below 1 prefer sparse distributions, i.e. most of the values within a single sample will be close to 0, and the vast majority of the mass will be concentrated in a few of the values

I've tested a few possible solutions using the contextual bandit environment above.

Red line (top line): contextual bandit with discrete action space (so no Dirichlet). This is just a benchmark.
Grey and orange (2 lines in the middle): two runs using exponential of logits or -logits.
Lines at the bottom: square of logits, absolute value of logits, softmax of logits

So exp of logits might seem empirically and intuitively the way to go. Somehow, logits become all nan starting from iteration ~200k, so the action becomes constant to a vector of 1/n summing up to 1. This is probably due to me not being familiar enough with the source code to appropriately fix the bug in Dirichlet. My attempt has been to change the following line of code in Dirichlet.__init__, and I'm not aware of side effects of this change or if there is anything else that should be changed.

# Before.
self.dist = tf.distributions.Dirichlet(concentration=inputs)

# After
self.dist = tf.distributions.Dirichlet(concentration=tf.exp(inputs))

@ericl Let assume that the solution is to use the exponential of logits as concentration parameter for Dirichlet. Would you be able to provide insights about what else I should change in the source code? Thank you

@ericl I know how to fix this issue. I'm happy to send a pull request if you or someone else agrees with the solution below (and after than #4550 will be fixed).

Context: Use Simplex to describe the action space and therefore Dirichlet as action distribution. Bug: Agents return nans. Diagnosis: there are two separate issues.

The Dirichlet is parametrized by the output of the policy network (logits) which can be either positive or negative. However, Dirichlet requires all concentration parameters to be positive. Therefore, the agent returns nans whenever there is a non-positive value in the tensor of logits. This issue has been discussed in the previous posts.
When max(logits) - min(logits) >> 0, then a sampled action from the Dirichlet might contain a zero due to numerical error. However, the support of Dirichlet are positive real numbers, so 0s are not allowed. Therefore, when calculating the log probability of the sample during training, tensorflow raises an error.

Solutions:

Rather than parametrizing Dirichlet with the logits of the policy network, use the exponential of logits. If the logits are normally distributed, then their exp is distributed as a log-normal distribution. A log-normal distribution is a good representation of the concentration parameters of Dirichlet (positive, high density around 1 i.e. no priors at the beginning of training).
Use clipping (either when sampling or when calculating the log probabilities). I'm inclined to go for the latter because there is nothing wrong with sampling zeros per se (e.g Dirichlet to descrive weights of a financial portfolio where 0 means no investment in the i-th financial asset).

@FedericoFontana using exp(logits) seems like a reasonable choice to me. Do you observe stable training with this support? Clipping to some epsilon value during probability calculation sounds good too.

ray-project / ray