google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.26k stars 934 forks source link

Replicating the experiments from "Neural Replicator Dynamics" paper #148

Closed sullins2 closed 3 years ago

sullins2 commented 4 years ago

I have been reading "Neural Replicator Dynamics" (NRD) and "Actor Critic Policy Optimization in Partially Observable Multiagent Environments" (ACPO) and I am trying to replicate some of the results in the NRD paper. From what I can tell, the implementation of NRD on here is different than the setup used in the paper, which uses what appears to be more similar to how the ACPO paper is done (and the kuhn_policy gradient code on here).

Correct me if I am wrong, but the QPG policy update rule from ACPO corresponds to the Softmax Policy Gradient (SPG) mentioned in NRD? In which case, this would be the one to modify with the "one-line-fix" from NRD? I have been attempting to incorporate this change but so far have been unsuccessful. Is the exact setup used in the NRD paper somewhere and I am missing it, or can anyone shed any light on where exactly to modify QPG so that it becomes the new update rule from NRD i.e. "bypasses the gradient through the softmax?"

Thanks in advance

lanctot commented 4 years ago

Hi @sullins2,

Yes: QPG from the Srinivasan et al. 2018 ("the RPG paper") corresponds to SPG in the NeuRD paper, correct. The one-line fix is basically changing the optimization criterion as outlined in the paper, which should also be found in the NeuRD implementation in this code base as well, btw.

Two important points:

  1. The NeuRD implementation in OpenSpiel (python/algorithms/neurd.py) does not have a sampled version. but we want to add it eventually, see here: https://github.com/deepmind/open_spiel/issues/122 ). We used an internal implementation for this paper, so we still have to write an OpenSpiel version.

  2. Proper use of entropy bonuses / regularization is critical to convergence in both cases. However, the type of entropy used in the two papers is different (in the RPG paper, we used the usual entropy bonus common for policy gradient methods, the NeuRD paper used it on the q-value head, but it's also a fundamentally different form of entropy IIUC). We are currently clarifying this in our camera-ready version, which we will post on arXiv after the deadline, in a week.

I think one of us has started on a sampled-based NeuRD, and there are many people who can answer this question better than I can, I will bring it up with the team and encourage them to follow-up here.

lanctot commented 4 years ago

Just a quick clarification:

Yes: QPG from the Srinivasan et al. 2018 ("the RPG paper") corresponds to SPG in the NeuRD paper, correct.

That is true only with respect to the empirical results from the RPG paper (e.g. our implementation of QPG in that paper did indeed use softmax). The theory part of the RPG paper requires that the policy be defined using an L2 projection to the nearest point in the simplex.

lanctot commented 4 years ago

Update: we have this partly done internally, but the form of entropy used in the NeuRD paper was quite different than in the RPG paper (explained thoroughly in a paper released today: https://arxiv.org/abs/2002.08456). We have this form of entropy implemented internally, but there are many dependencies on internal code which is hard to separate at the moment. The people working in this are busy with other responsibilities. They told me that they do intend to get it done and released, but I do not know when that will be.

I have asked the people working on this to comment further regarding the specific update. One thing that would be curious to me is whether NeuRD would work well using the usual PG-style entropy. That is something that would be useful to know and easy for you to try.

szrlee commented 4 years ago

Update: we have this partly done internally, but the form of entropy used in the NeuRD paper was quite different than in the RPG paper (explained thoroughly in a paper released today: https://arxiv.org/abs/2002.08456). We have this form of entropy implemented internally, but there are many dependencies on internal code which is hard to separate at the moment. The people working in this are busy with other responsibilities. They told me that they do intend to get it done and released, but I do not know when that will be.

I have asked the people working on this to comment further regarding the specific update. One thing that would be curious to me is whether NeuRD would work well using the usual PG-style entropy. That is something that would be useful to know and easy for you to try.

Suggest citing the paper "Divergence-augmented Policy Optimization" https://papers.nips.cc/paper/8842-divergence-augmented-policy-optimization.pdf

dhennes commented 4 years ago

Hi @sullins2,

As @lanctot mentioned, we are planning on releasing a version of sampling-based NeuRD with function approximation, though do not have a clear timeline for this yet due to ongoing other commitments.

However, to help you get started and clarify your questions, note that in order to add the NeuRD correction, you probably want to follow these steps:

  1. Add support for a new loss string/class in the open_spiel.python.algorithms.policy_gradient.PolicyGradient agent.
  2. Implement the NeuRD loss itself in a new loss class in open_spiel/python/algorithms/losses/rl_losses.py, using BatchQPGLoss as a starting point
  3. Add logit thresholding as described in the paper.

As @lanctot also noted, the form of entropy regularization used in NeuRD is that introduced in https://arxiv.org/abs/2002.08456, which is not yet integrated in the OpenSpiel PolicyGradient agent.

sullins2 commented 4 years ago

Thanks for the responses @lanctot @dhennes

Ignoring (for now) the unique form of entropy regularization that NeuRD uses mentioned above, I am wondering if below the modified BatchQPGLoss is correctly implementing the NeuRD change. The BatchNeuRDLoss function is the same as BatchQPGLoss, but the compute_adantages() function has been modified (last two lines), and I put in the thresholded() function in from neurd.py

# Unchanged from original BatchQPGLoss code except no entropy
class BatchNeuRDLoss(object):
  """Defines the batch NeuRD loss op."""

  def __init__(self, entropy_cost=None, name="batch_qpg_loss"):
    self._entropy_cost = entropy_cost
    self._name = name

  def loss(self, policy_logits, action_values):
    """Constructs a TF graph that computes the QPG loss for batches.

    Args:
      policy_logits: `B x A` tensor corresponding to policy logits.
      action_values: `B x A` tensor corresponding to Q-values.

    Returns:
      loss: A 0-D `float` tensor corresponding the loss.
    """
    _assert_rank_and_shape_compatibility([policy_logits, action_values], 2)
    advantages = compute_advantages(policy_logits, action_values)
    _assert_rank_and_shape_compatibility([advantages], 1)
    total_adv = tf.reduce_mean(advantages, axis=0)

    total_loss = total_adv
     # if self._entropy_cost:
      # policy_entropy = tf.reduce_mean(compute_entropy(policy_logits))
      # entropy_loss = tf.multiply(
      #     float(self._entropy_cost), policy_entropy, name="entropy_loss")
      # total_loss = tf.add(
       #     total_loss, entropy_loss, name="total_loss_with_entropy")

    return total_loss

def compute_advantages(policy_logits, action_values, use_relu=False):
  """Compute advantages using pi and Q."""
  # Compute advantage.
  policy = tf.nn.softmax(policy_logits, axis=1)
  # Avoid computing gradients for action_values.
  action_values = tf.stop_gradient(action_values)

  baseline = compute_baseline(tf.stop_gradient(policy), action_values)

  advantages = action_values - tf.expand_dims(baseline, 1)

  if use_relu:
    advantages = tf.nn.relu(advantages)

  # Compute advantage weighted by policy.
  advantages = tf.stop_gradient(thresholded(policy_logits, advantages, threshold=3.0))
  policy_advantages = -tf.multiply(policy_logits, tf.stop_gradient(advantages))
  return tf.reduce_sum(policy_advantages, axis=1)

# Taken directly from neurd.py
def thresholded(logits, regrets, threshold=2.0):
  """Zeros out `regrets` where `logits` are too negative or too large."""
  can_decrease = tf.cast(tf.greater(logits, -threshold), tf.float32) 
  can_increase = tf.cast(tf.less(logits, threshold), tf.float32)
  regrets_negative = tf.minimum(regrets, 0.0)
  regrets_positive = tf.maximum(regrets, 0.0)
  return can_decrease * regrets_negative + can_increase * regrets_positive
lanctot commented 4 years ago

Looks ok to me, but I wasn't close to the implementation. Any thoughts @dhennes @shayegano @perolat @dmorrill10 ?

dhennes commented 4 years ago

It looks like the logits are not centered around zero before you pass them to the threshold function. Note that you can re-center logits without changing the policy nor the logit gap. Other than that, I think you're on the right track!

sullins2 commented 4 years ago

@lanctot @dhennes

Hi, is a sampling based version of NeuRD still planning to be released? Will it likely be an extension (as noted elsewhere) to the policy_gradient.py code or its own separate implementation? Would you suggest I make a pull request with the above implementation (would for now not include the entropy regularization NeuRD actually uses that is linked to above). Thanks.

lanctot commented 4 years ago

Hi @sullins2,

It definitely won't be me to do it, but last I spoke to @dhennes and @perolat, it was still indeed on the to-do list. I would very much like to have it in the code base but I was not very involved on the implementation side of the NeuRD paper.

Yes, the plan would be to add it to either the policy_gradient.py or as its own file that would very much resemble policy_gradient.py.

We are all quite busy with NeurIPS at the moment, but a PR is a great idea to help spark progress on this. I will contact Daniel and Julien and hopefully we can get it looked at in June!

sullins2 commented 4 years ago

@lanctot Thanks!

lanctot commented 4 years ago

Hi @sullins2, @perolat has posted a copy of the code we used for the paper experiments, see here: https://github.com/deepmind/open_spiel/issues/316#issuecomment-672074701

sullins2 commented 4 years ago

Thanks for the heads up @lanctot and thanks for posting it @perolat

lanctot commented 3 years ago

@sullins2 did you ever get a sample-based impl of NeuRD fully working? Would you be interested in contributing it via a pull request?