Closed federicofontana closed 5 years ago
@Szkered, any ideas here?
I've done more digging in the source code of ray and I've shed some light on why logits are well defined and actions probabilities are not.
The context is that the input of Dirichlet (the action distribution associated with Simplex) are logits, then passed to tf.distrib.Dirichlet in Dirichlet.__init__
. So Dirichlet is parametrized by the logits of the Policy graph (I'm not sure if this is correct, probably not considering the output from the code below, but there we go).
Now, it turns out that internal parameters of tf.distrib.Dirichlet might be invalid, in which case a sample from the distribution return np.nan
. See docs here) and example below.
with tf.Session() as session:
logits = [-0.00207247, -0.00186774, 0.00244786, -0.00415111, 0.00086402]
action_distrib = tf.distributions.Dirichlet(
concentration=logits,
allow_nan_stats=False,
)
sample = action_distrib.sample().eval()
print(sample)
# Console:
[nan nan 0. nan 0.]
So we now know that it's the action distribution which created np.nan
in the action probabilities. So the point seems to be that logits cannot directly feed to tf.distrib.Dirichlet, which is what's happening in the ray source code as of now. So now the question is: what is the right way to parametrize Dirichlet?
One possible solution could be to use the softmax of logits (see code below).
def softmax(x, axis=None):
x = x - x.max(axis=axis, keepdims=True)
y = np.exp(x)
return y / y.sum(axis=axis, keepdims=True)
with tf.Session() as session:
# Note that the concentration parameter is now the softmax logits,
# as opposed to logits.
action_distrib = tf.distributions.Dirichlet(
concentration=softmax(logits),
allow_nan_stats=False,
)
sample = action_distrib.sample().eval()
print(sample)
# Console:
[2.39824214e-04 1.06406130e-05 6.64767340e-02 6.38899220e-02 8.69382879e-01]
QUESTION: could anyone confirm if providing the softmax of logits as the parametrization of Dirichlet makes sense? I'm not sure if this is mathematically sound. I would be happy to create a pull request if someone confirmed the solution. @Szkered @ericl
Based on https://en.m.wikipedia.org/wiki/Dirichlet_distribution, it might be undesirable to restrict the parameters to [0,1]. Perhaps an alternative parametrization would be to square each parameter independently?
Values of the concentration parameter above 1 prefer variates that are dense, evenly distributed distributions, i.e. all the values within a single sample are similar to each other. Values of the concentration parameter below 1 prefer sparse distributions, i.e. most of the values within a single sample will be close to 0, and the vast majority of the mass will be concentrated in a few of the values
I've tested a few possible solutions using the contextual bandit environment above.
So exp of logits might seem empirically and intuitively the way to go. Somehow, logits become all nan
starting from iteration ~200k, so the action becomes constant to a vector of 1/n summing up to 1. This is probably due to me not being familiar enough with the source code to appropriately fix the bug in Dirichlet. My attempt has been to change the following line of code in Dirichlet.__init__
, and I'm not aware of side effects of this change or if there is anything else that should be changed.
# Before.
self.dist = tf.distributions.Dirichlet(concentration=inputs)
# After
self.dist = tf.distributions.Dirichlet(concentration=tf.exp(inputs))
@ericl Let assume that the solution is to use the exponential of logits as concentration parameter for Dirichlet. Would you be able to provide insights about what else I should change in the source code? Thank you
@ericl I know how to fix this issue. I'm happy to send a pull request if you or someone else agrees with the solution below (and after than #4550 will be fixed).
Context: Use Simplex
to describe the action space and therefore Dirichlet
as action distribution.
Bug: Agents return nans.
Diagnosis: there are two separate issues.
Solutions:
@FedericoFontana using exp(logits) seems like a reasonable choice to me. Do you observe stable training with this support? Clipping to some epsilon value during probability calculation sounds good too.
System information
pip install ray
Source code / logs
below.Describe the problem
Context: My use case is to create custom
gym.Env
and solve those environments withray
agents. Ray has worked very well in my use case, where different agents have solved both continuous (gym.spaces.Box
) and discrete (gym.spaces.Discrete
) action spaces in custom environments.Problem: Problems start when using the action space
ray.rllib.models.extra_spaces.Simplex
in the environment (see #4070). In particular, the problem is that the agent returns actions with some np.nan(s). However, I've done some diggings and logits are calculated correctly (before being mapped to action distribution with a softmax I guess), so the issue should be in the layer of logic in ray mapping logits to an action.I want to remark that simply switching
Simplex
toBox
in the code below allows PPO to solve the environment in a few minutes, so there must be something wrong withSimplex
.Motivation:
Simplex
does not seem to work properly at the moment but it would allow to solve many interesting use cases with constrained action spaces.Question: Why is Simplex with PPO not working correctly in the self-contained and reproducible code snippet below?
Source code / logs