koaning / sushigo

An OpenAi-like environment for the sushi go card game.
MIT License
3 stars 1 forks source link

Include per-turn rewards and/or per-round round stats and/or per-game game-stats #8

Open RobRomijnders opened 7 years ago

RobRomijnders commented 7 years ago

@koaning I got two issues with the current reward system

My suggestion to solve both problems is to include a

koaning commented 7 years ago

first a few details;

I am in favour of having a single environment that just gives you points, just like you'd get it if you were playing as a human. If your player wants to optimise for winning a round we will still supply you points that allow the player to infer this themselves but I'd like to keep the game environment minimal if possible.

In the situation where the algorithm wants to optimise the final score; it can be left to the player to ignore intermediate rewards and to only focus on the final reward. The final reward can still be propagated to choices made early in the game by whatever method the bot wants. You could use Q-learning, for example, but I'd prefer if this is done by the player/agent. Not the environment.

I need to think about your suggestion. I am wondering how useful finish_round is because the goal is not to win rounds but to win a game. Rounds are merely there to make the distribution of drawn cards more predictable in later rounds. The finish_game function sounds reasonable, but I wonder if it may be better to have that be a property that is assigned by the environment. Any preference to have it be a method?

RobRomijnders commented 7 years ago

Agree on the technical details.

we will still supply you points that allow the player to infer this themselves How would a player infer that it won a round? It only receives its own points, doesn't he?

the goal is not to win rounds but to win a game agree. But for some algorithms, the noise in the entire game can disguise the value of a certain action. You make maybe 20+ actions after your first action. so it could be hard to propagate this information all the way back (For example, policy gradients)

koaning commented 7 years ago

ah. yes the player currently does not receive the opponents points. i could add that to the observations. would that suffice?

RobRomijnders commented 7 years ago

Ah, that would also solve issue #7 .

I must think about this a bit. I havent seen RL being trained before with opponents' points as observations. But given our domain-knowledge of the Sushi Go game. it seems reasonable