Include per-turn rewards and/or per-round round stats and/or per-game game-stats

RobRomijnders commented 7 years ago

@koaning I got two issues with the current reward system

So some algorithms might like to optimize for the points per turn [integer]. Some algorithms might optimize for points per round [integer] and some algorithms might optimize for winning a game [boolean]. Do we return all three variables? and if so, how? that leads to the second issue
An algo might like to optimize also its final move. At the moment, he gets a reward (the per-turn reward). But he doesnt get a call for the final per-turn reward, so he cannot know how good that action was.

My suggestion to solve both problems is to include a

finish_round() function on the player to supply a) if he won the round and b) by how many points
finish_game() function on the player to supply a) if he won the game and b) by how many points

koaning commented 7 years ago

first a few details;

technically the reward is a float simply because you can get half points in some edge cases.
the final move is the final card that is left in your hand, not much of a choice

I am in favour of having a single environment that just gives you points, just like you'd get it if you were playing as a human. If your player wants to optimise for winning a round we will still supply you points that allow the player to infer this themselves but I'd like to keep the game environment minimal if possible.

In the situation where the algorithm wants to optimise the final score; it can be left to the player to ignore intermediate rewards and to only focus on the final reward. The final reward can still be propagated to choices made early in the game by whatever method the bot wants. You could use Q-learning, for example, but I'd prefer if this is done by the player/agent. Not the environment.

I need to think about your suggestion. I am wondering how useful finish_round is because the goal is not to win rounds but to win a game. Rounds are merely there to make the distribution of drawn cards more predictable in later rounds. The finish_game function sounds reasonable, but I wonder if it may be better to have that be a property that is assigned by the environment. Any preference to have it be a method?

RobRomijnders commented 7 years ago

Agree on the technical details.

we will still supply you points that allow the player to infer this themselves How would a player infer that it won a round? It only receives its own points, doesn't he?

the goal is not to win rounds but to win a game agree. But for some algorithms, the noise in the entire game can disguise the value of a certain action. You make maybe 20+ actions after your first action. so it could be hard to propagate this information all the way back (For example, policy gradients)

koaning commented 7 years ago

ah. yes the player currently does not receive the opponents points. i could add that to the observations. would that suffice?

RobRomijnders commented 7 years ago

Ah, that would also solve issue #7 .

I must think about this a bit. I havent seen RL being trained before with opponents' points as observations. But given our domain-knowledge of the Sushi Go game. it seems reasonable

koaning / sushigo

Include per-turn rewards and/or per-round round stats and/or per-game game-stats #8