psiorx / Reinforcement-Learning

Simple C++ project that includes header only implementations of Monte Carlo Tree Search(MCTS), Temporal Difference Learning, Minimax, and other agents to evaluate their performance in various two player games.
10 stars 4 forks source link

Possible Logic Error in Backpropagation #1

Open gjstein opened 7 years ago

gjstein commented 7 years ago

You currently have the following code in the Backpropagation method:

    if(node->our_turn && score > 0) {
      node->stats.wins++;
    } else if(!node->our_turn && score < 0) {
      node->stats.wins--;
    }

However, this leaves two cases in which the reward is not properly applied, which causes problems for more sophisticated games. I've added the additional two cases and the algorithm seems to make more reasonable decisions now (I'm testing on my Connect Four implementation):

if(node->our_turn && score > 0) {
      node->stats.wins++;
    } else if(!node->our_turn && score < 0) {
      node->stats.wins--;
    } else if(node->our_turn && score < 0) {
      node->stats.wins--;
    } else if(!node->our_turn && score > 0) {
      node->stats.wins++;
    }

Try adding this for you Tic-Tac-Toe code and I'd be curious to see what the performance statistics look like when you do.

psiorx commented 7 years ago

@gjstein I don't think this is quite correct. I'm making an assumption that the game can only be lost due an action of your opponent. This means that negative reward can not be accrued on nodes associated with a state in which it is the agent's turn. It has the effect of assigning 0 reward on alternating levels of the tree after each rollout. With the above code, wins along a certain path will be cancelled out by losses along the same path (++ followed by --). This could result in a branch that has a 50% win rate actually showing up as having 0 wins for both players.

This is the plot with the code as is:

image

With the suggested code, I get this:

image

It's interesting that you're seeing better decisions in Connect4. I'll need to code it up to see if I can reproduce that.

gjstein commented 7 years ago

Hmm interesting. I saw that as well, so I rewrote some of the ways in which the score is attributed. You can see the full changes in my branch here. Play around with that code if you'd like, along with my implementation of Connect Four (which I'm pretty sure is correct...). I can message in detail about my motivation behind the changes tomorrow.

As for the results, my agent performs rather well against yours. Build & run the ./connectfour script: In my tests, my modifications have me winning 100% of the time. I can't quite figure out why yours is behaving oddly, but it fails to block its opponent some of the time, like this example (your agent is o):

- - - - - - -
- - - - - - -
- - x o - - -
- - o x - x -
- o x o o x -
- o x x o x -
IN_PROGRESS

- - - - - - -
- - - - - - -
- - x o - - -
- - o x - x -
- o x o o x -
o o x x o x -
O_WINS

My thinking is that by not back propagating the loss as I suggest above causes it to greedily search for winning board positions for it, while not taking positive action to prevent the opponent from beating it.

psiorx commented 7 years ago

New set of changes looks good, I'll pull it into master. I saw that you had Minimax commented out as a connect 4 player. I'm guessing you didn't have the 190 million gigs of ram needed for it to run to completion. :)