Open RobRomijnders opened 7 years ago
first a few details;
I am in favour of having a single environment that just gives you points, just like you'd get it if you were playing as a human. If your player wants to optimise for winning a round we will still supply you points that allow the player to infer this themselves but I'd like to keep the game environment minimal if possible.
In the situation where the algorithm wants to optimise the final score; it can be left to the player to ignore intermediate rewards and to only focus on the final reward. The final reward can still be propagated to choices made early in the game by whatever method the bot wants. You could use Q-learning, for example, but I'd prefer if this is done by the player/agent. Not the environment.
I need to think about your suggestion. I am wondering how useful finish_round
is because the goal is not to win rounds but to win a game. Rounds are merely there to make the distribution of drawn cards more predictable in later rounds. The finish_game
function sounds reasonable, but I wonder if it may be better to have that be a property that is assigned by the environment. Any preference to have it be a method?
Agree on the technical details.
we will still supply you points that allow the player to infer this themselves
How would a player infer that it won a round? It only receives its own points, doesn't he?
the goal is not to win rounds but to win a game
agree. But for some algorithms, the noise in the entire game can disguise the value of a certain action. You make maybe 20+ actions after your first action. so it could be hard to propagate this information all the way back (For example, policy gradients)
ah. yes the player currently does not receive the opponents points. i could add that to the observations. would that suffice?
Ah, that would also solve issue #7 .
I must think about this a bit. I havent seen RL being trained before with opponents' points as observations. But given our domain-knowledge of the Sushi Go game. it seems reasonable
@koaning I got two issues with the current reward system
So some algorithms might like to optimize for the points per turn [integer]. Some algorithms might optimize for points per round [integer] and some algorithms might optimize for winning a game [boolean]. Do we return all three variables? and if so, how? that leads to the second issue
An algo might like to optimize also its final move. At the moment, he gets a reward (the per-turn reward). But he doesnt get a call for the final per-turn reward, so he cannot know how good that action was.
My suggestion to solve both problems is to include a
finish_round()
function on the player to supply a) if he won the round and b) by how many pointsfinish_game()
function on the player to supply a) if he won the game and b) by how many points