Adversarial approach to data-set balancing and training

Today we clean the following attributes prior to feeding the data to the neural network. However pre-cleaning the data is problematic during self play since periodically the neural network should favour black or white depending on the current tactics and strategies:

Throw away games with the wrong komi (dream go use 7.5).
Throw away handicap games.
Throw away excessive black or white wins (i.e. make sure the ratio between the number of examples games where black and white wins is 1:1).

An alternative approach to the data cleaning would be to employ an adversarial network as suggested by Gilles Louppe, Michael Kagan, and Kyle Cranmer [1] [2]. Their approach is to include additional adversarial networks that we train (in a separate tower) to predict these attributes based on the policy and value output, and then in the main tower minimise their accuracy.

This approach would allow us to maintain more of the training data-set.

[1] Gilles Louppe, Michael Kagan, Kyle Cranmer, "Learning to Pivot with Adversarial Networks", https://papers.nips.cc/paper/6699-learning-to-pivot-with-adversarial-networks [2] https://blog.godatadriven.com/fairness-in-ml

Komi

TL;DR I don't think we can solve the komi problem with this approach.

Komi has a direct influence on the score, so we cannot tell the network to not make the value a function of the komi. We can however make the policy not play differently depending on the komi, so given an adversarial network K(p) with one output class for every komi that occurs in the training set. We can then train K(p) to predict the komi based on the policy and include the negative cross entropy of K(p) in the main tower loss.

This approach may or may not work, since it is unclear if the policy contains enough information to determine the komi, but if we include the board state in the input of K(p) then the adversary may just ignore the policy and learn to predict the komi based on the board state.

Handicap

We do not include the handicap in the examples at the moment, so we would need to complement the input data. If the input data is complemented then we can follow a similar approach as for balancing black and white wins.

Excessive Black or White wins

This is pretty straight forward, we can create an adversarial network that is a function of the value head that tries to predict whether the winner is black or white.

After using the adversarial approach against colour bias for a few networks, I am seeing odd results. On one hand the adversarial network seems to be able to predict the colour based on the value head, indicating a 2.3067% bias towards one of the two. But the the bias weight is 5e-3, which indicate that the difference between the two colors is hair thin.

The trained networks themselves does not seem to indicate any strong bias in either direction, and it was not very noticeable problem beforehand, so it is hard to tell if it is working or not.

screenshot from 2018-06-23 17-55-15 screenshot from 2018-06-23 18-26-23

Just going to leave this here for potential future research into whether there would be any advantages in using robust training for go. The intuition would say no, since most adversarial attacks rely in gradient based search, which is not applicable to binary features (as we control the feature generation).

But if there are other side-effects of adversarial training that might yield better internal representation that generatize better (as suggested by the article).

https://arxiv.org/pdf/1805.12152v2.pdf

Early results from training a network using adversarial training (four step PGD) with a step size of 0.002, networks looks about the same in policy play (within margin for error, especially considering the 19 unknown results), but the value head of the adversarial network seems to be weaked:

dg-i-128 is a network trained using normal procedures.
dg-adv is a network trained using adversarial training procedures.

Blitz games (1 rollout)

dg-i-128 v dg-adv (500/500 games)
unknown results: 19 3.80%
board size: 19   komi: 7.5
           wins              black          white        avg cpu
dg-i-128    250 50.00%       127 50.80%     123 49.20%      1.37
dg-adv      231 46.20%       120 48.00%     111 44.40%      1.37
                             247 49.40%     234 46.80%

$ ./tools/sgf2big.py tools/gomill-blitz.ctl.games/ | ./tools/sgf2score.py | ./tools/sgf2elo.py 
Arbitrated 19 games without a winner
dg-adv:0.5.0                    0.00
dg-i-128:0.5.0                 16.69

Tournament games (3,200 rollouts)

dg-i-128 v dg-adv (100/100 games)
unknown results: 11 11.00%
board size: 19   komi: 7.5
           wins              black         white       avg cpu
dg-i-128     58 58.00%       27 54.00%     31 62.00%     74.98
dg-adv       31 31.00%       12 24.00%     19 38.00%     79.26
                             39 39.00%     50 50.00%

$ ./tools/sgf2big.py tools/gomill-sp.ctl.games/ | ./tools/sgf2score.py | ./tools/sgf2elo.py 
Arbitrated 11 games without a winner
dg-adv:0.5.0                    0.00
dg-i-128:0.5.0                115.21

Conclusion

Overall it is unclear whether adversarial training is worth pursuing for Go since most of the features representations used are binary and completely controlled by the Engine (which will not generate adversarial examples). I will make one more attempt with a larger step size (0.01), and we'll see how that pans out.

I will make one more attempt with a larger step size (0.01), and we'll see how that pans out.

A stronger adversary seems to prevent the network from learning anything (accuracy does not raise above 1 / 361). This is not unexpected, since it has been observed that too strong adversaries can cause small models to not converge [1].

This would also explain why the adversarial network performed worse during testing, as the adversarial model is effectively smaller than the normal one. Subjectively this can also be observed from policy play games performed by the adversarial model, which plays like models that are too small to think globally does (but it has very good local shape).

policy_play_adversary.sgf.gz

[1] "There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits)", https://arxiv.org/pdf/1805.12152v2.pdf

Given the conclusion above, it is unclear on what the next step should be. On one hand there does not seem to be much point to adversarial networks if they are strictly weaker and we do not suffer from the adversarial examples problem.

But if it does yield a better internal feature representation, then given a sufficiently large model is should yield a less finicky go player (which can perform better outside of its training domain).

kblomdahl / dream-go