google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.25k stars 933 forks source link

Implementation of dilated form of MMD #1165

Closed Atan03 closed 10 months ago

Atan03 commented 10 months ago

Hello, I am currently replicating the experiments of DOGDA and DOMWU as described in the paper[^1]. My goal is to modify the existing code based on the dilated version of the MMD algorithm to incorporate a specific adjustment. It seems that setting $\alpha = 0$ in MMD causes it to degenerate into MD. Therefore, I plan to design an improved version based on this observation. However, I have some uncertainties about the gradient computation step in the dilated version, especially regarding the subtraction of num_children.

I understand that the computed gradient is related to the dilated entropy of the sequence form strategy, but I am still struggling to comprehend it.

Your assistance in clarifying this matter would be highly appreciated. Despite not having complete clarity, I have followed the previously mentioned idea and completed the implementation of DOMWU (equivalent to MMD with $\alpha = 0$, negative entropy, and an optimistic operation), with the results displayed below.

Please forgive my limited familiarity with dilated form updating. If I intend to implement DOGDA (equivalent to MMD with $\alpha = 0$, L2-norm, and an optimistic operation), do I only need to replace the term involving negative entropy in MMD with L2-norm and replace the softmax with projection onto the probability simplex?

Once again, your assistance in understanding this would be greatly appreciated, even though I recognize that it is more of a theoretical question than a coding implementation issue.

[^1]: Last-iterate Convergence in Extensive-Form Games

ryan-dorazio commented 10 months ago

Hi @Atan03! I'm glad you are using my code and am happy to answer your question.

It seems that setting $\alpha =0$ in MMD causes it to degenerate into MD.

Exactly! MMD with $\alpha =0$ becomes mirror descent-ascent with dilated entropy.

do I only need to replace the term involving negative entropy in MMD with L2-norm and replace the softmax with projection onto the probability simplex?

It is true that you would need to change both the negative entropy and softmax terms, but you would also need to the change the gradient computation and specifically the part you pointed out with regards to the num_children. The subtraction of num_children trick is only valid for the case of dilated entropy but not dilation with the l2 norm.

Below I can run through the math on why this is the case.

Say we want to compute the partial derivative $\frac{d\psi}{d (s,a)}$ of dilated entropy $\psi$ with respect to the sequence $(s,a)$ corresponding to a state $s$ and action $a$. Basically there will be two types of terms for which we will be concerned with:

  1. The dialted entropy at state $s$
  2. All the children states of the sequence $C(s,a)$. These are all the possible states that can be reached after selecting action $a$ at state $s$.

Mathematically we have:

$$ \frac{d\psi}{d (s,a)} = \underset{(1)}{\frac{d}{d (s,a)}x_{p(s)}\psis\left(\frac{x{(s, \cdot)}}{x{p(s)}}\right)} + \underset{(2)}{\frac{d}{d (s,a)} \sum{j \in C(s,a)}}x{(s,a)}\psi{j}\left( \frac{x{(j,\cdot)}}{x{(s,a)}}\right) $$

I've introduced some new notation, here is a breakdown:

The first part (1) by chain rule just reduces to the partial derivative of negative entropy (or wtv dgf you use) with respect to the policy at state $s$, $\frac{d}{da} \psi_s(\pi_s).$

The second part is more interesting for us. If we just look at one child state $j \in C(s,a)$, then by the product rule and chain rule we get: $\frac{d}{d(s,a)}x_{(s,a)}\psij\left(\frac{x{(j,\cdot)}}{x_{(s,a)}}\right) = \psi_j\left(\pi_j\right) -\langle \nabla \psi_j(\pi_j), \pi_j \rangle.$

If we plug in negative entropy as $\psi_j$ we get $\sum_a\pi_j(a)\log(\pi_j(a)) - \langle \log(\pi_j)+\mathbf{1}, \pi_j \rangle = -\sum_j \pi_j(a) = -1.$

Therefore, each child contributes a $-1$! Hence we only need to subtract the number of children, that is the size of the set $C(s,a)$. If you were to use the squared l2 norm you would instead have have to compute $\psi_j\left(\pi_j\right) -\langle \nabla \psi_j(\pi_j), \pi_j \rangle$ which should just be $-\frac{1}{2}||\pi_j||^2$.

Atan03 commented 10 months ago

Thank you for your comprehensive explanation! It has been immensely helpful to me. Now, I have a comprehensive understanding of the formula. Following your guidance, I successfully implemented the OGDA on NFG and achieved convergence results. I am now closing this issue.