google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.25k stars 933 forks source link

Quoridor Movement Action IDs keep changing #1158

Closed aadharna closed 5 months ago

aadharna commented 10 months ago

So, I'm playing quoridor, and I was trying to figure out which action IDs corresponded to moving the agent. Therefore, after I placed all the walls, I went to look at the legal actions thinking these should be the movement action IDs.

However, I noticed that these IDs keep changing based on where I am on the board/the wall configuration nearby me. That seems really weird as I would expect up to always be the same action id each time.

For example, here are the available actions at this board state:

image

Meanwhile if I then move up (found with trial and error), I have a whole new set of actions to choose from that don't seem in any way related to the previous movement actions.

image

I don't see how an agent could ever learn to play this game if their movement IDs are not fixed. Is there some scheme I'm just missing?

lanctot commented 10 months ago

Hmm, I don't know Quordior.

Could the movement IDs correspond to the different locations on the board? What move did you take from the first to the second state?

@tewalds

lanctot commented 10 months ago

Either way, it should not be the case that the same moves are not represented by the same integers (moves should be have state-independent meanings). Like, in chess, a1a2 is always represented by the same integer. All OpenSpiel games are like that. So, if that's not true for Quoridor then it's a bug that we have to fix.

Edit: I let Timo (original contributor) know so hopefully we will have an answer, but he might already be off for the holidays so I'm not sure when we'll hear back from him. I vague remember some changes were made to Quoridor to support the multiplayer variants. I wonder if something was changed at that point. But let's confirm that it is a bug first before jumping to any conclusions.

aadharna commented 10 months ago

Like, in chess, a1a2 is always represented by the same integer. All OpenSpiel games are like that.

Gotcha. So far I had only played some of the smaller games and so hadn't come across this yet.

While I can see the logic of that design choice, it massively blows up the action space, no? Why do it this way rather than for each controllable piece select one of k moves (i.e. why not make the actions piece-centric?) like go north, south, etc or for the bishop move along the right upper diagonal 2 spaces instead of having per move (going from a1 to a2, b4 to b5) integers? I suppose my deeper question is why doesn't this choice completely break learning? Because now I can't associate a given board state with a common action. I don't know what action will take me up in a given state so how can I appropriately distribute my probability mass? With enough interactions that span enough of the game tree will policy gradients just figure this out? I suppose in smaller games where you can fully enumerate a q-table this doesn't matter, but in more complex games it definitely feels like it could make learning a policy much harder.


Since that's the design pattern for all of OS, then this isn't a bug. So I'll close this issue shortly.

lanctot commented 10 months ago

I think you may have misunderstood.

Let's take a game like Tic-Tac-Toe. There are 9 unique moves, each corresponding to one spot on the 3x3 square. At any state in the game, move 0 always corresponds to the top-left square. Was that clear?

This choice is standard and is actually the better one for learning, so I think you misunderstood that OpenSpiel is doing the harder thing. In chess the moves might actually have the piece label in there too (not sure, would have to check).

lanctot commented 10 months ago

Because now I can't associate a given board state with a common action.

It's much worse if the action ids are very strongly dependent on the board. Like if moving a1 to a2 is action 763 in one state and action 14 in another, there's really no way to learn anything general about a1a2 (which a function approximator will almost surely have to do representing a Q-function or policy).

lanctot commented 10 months ago

While I can see the logic of that design choice, it massively blows up the action space, no?

As opposed to what, though? I don't see any other viable options.

It'll be larger than relative ids (like, assign indices equal to the position in the legal move list) but RL would have no hope with integers that are too strongly dependent on the state. The policy should be able to generalize its understanding of a move over many states.

Why do it this way rather than for each controllable piece select one of k moves (i.e. why not make the actions piece-centric?) like go north, south, etc or for the bishop move along the right upper diagonal 2 spaces instead of having per move (going from a1 to a2, b4 to b5) integers?

Well, firstly, there are games that can't be broken down into a series of pieces and relative moves. OpenSpiel supports many more types of games than perfect information games with boards and pieces, so that wouldn't generalize. Besides, the pieces have locations so you'd need to track where the pieces are on the board anyway. The board combined with a state-independent action will identify the piece type immediately.

Each game does this independently, but most of them have moves that are understandable without state (like move 0 corresponding to the top-left piece in Tic-Tac-Toe).

lanctot commented 10 months ago

Since that's the design pattern for all of OS, then this isn't a bug. So I'll close this issue shortly.

Please don't yet because it's not resolved. I have to understand Quoridor more first.

In Quoridor: are the move actions simply { left, up, right, down } and always one space? Or can you move like a rook in chess?

If the latter, then I think actions corresponding to the position that you start and land make more sense. If the former, then there probably should just be four unique move actions.

aadharna commented 10 months ago

Let's take a game like Tic-Tac-Toe. There are 9 unique moves, each corresponding to one spot on the 3x3 square. At any state in the game, move 0 always corresponds to the top-left square. Was that clear?

Yep! That's clear.

This choice is standard and is actually the better one for learning, so I think you misunderstood that OpenSpiel is doing the harder thing. In chess the moves might actually have the piece label in there too (not sure, would have to check).

In a game like tic-tac-toe / go / etc where you just place an object down somewhere on the gameboard, this makes perfect sense.

It's much worse if the action ids are very strongly dependent on the board. Like if moving a1 to a2 is action 763 in one state and action 14 in another, there's really no way to learn anything general about a1a2 (which a function approximator will almost surely have to do representing a Q-function or policy). ... As opposed to what, though? I don't see any other viable options.

I suppose I was thinking about games with pieces that move (e.g., chess). In this case, it seems to me that you could also setup the action space as a multi-discrete action space of MultiDiscrete([num_pieces, num_states]). So if I wanted to move the right-most column pawn to a2, I could present the action as vector of [rightmost_pawn_id, a2_id]. Then, the structure of the actions are all consistent and easily understandable. Furthermore, having discerable structure in the action space should make learning easier as moving the pawn will always start with selecting the pawn. Alternatively, I could also see just having [rightmost_pawn_id, forward, 1_square] in the pawn case as the action as well or in a bishop case, it would be something like [black_bishop_id, forward, right_diagonal, 2_squares]. In the case where you place an object down, that would resolve back down to a Discrete(num_states) and you're right back where you were before where in TTT you have 9 choices on a 3x3 board.

Is this clearer?

This type of action space scheme is used in games like Micro-RTS or engines like Griddly to do things like [select mining agent 1, move, location] or [select mining agent 1, deposit materials, repository]

In Quoridor: are the move actions simply { left, up, right, down }

There are two types of actions in Quoridor.

  1. place a 2-segment wall anywhere on the empty board (sort of like placing a stone in Go) that blocks off certain movement paths transitions in the game (e.g., you can't go from e6 --> e7 if there is a wall separating these two squares)
  2. move in a cardinal direction one square {left, up, right, down}, as you said.

Right now these are all contained in a very large action space where there are different actions for moving from the current square to another square. So in square b4, there are four unique legal actions of moving to the neighboring squares (assuming the path is not blocked) and then there are 2 x boardsize x boardsize (or something like this number of) additional actions for placing a wall segement down on the board and these are always consistent. All of these are flattened into a single action space and it's not clear what distinct action id corresponds to what thing. For example, if there are no more wall segments left to be placed by the agent, then there are only up to four legal actions available. But these are not {go north, go south, go east, go west} they are {c3 --> c4, c3 --> b3, c3 --> d3, c3 --> c2} (assuming no walls are blocking those transitions and the agent is in c3)

lanctot commented 10 months ago

Ok, perfect, understood.. this clarifies a lot.

For the case of quoridor: the movement actions should be separate from the wall actions and there should indeed be only 4 of them: one for each direction. The wall actions, though, make sense the way they are currently implemented (for the same reason as the go and tic-tac-toe examples). So we should fix this. Would you volunteer to submit a PR?

For the chess examples... yes I see what you mean, and we have talked about supporting "structured actions" for these more complex games (similar to your micro RTS and Griddly examples) and just haven't gotten to it. Probably they will come in OpenSpiel 2.0 :)

I will say for chess in particular, it was implemented by Matthew Lai who has quite a bit of experience with chess (via giraffe and then AlphaZero), so I don't think it will be as hard as you might think. The rightmost_pawn_id has the issue that you force the observer to differentiate between different specific pawns, i.e if pawn 6 moves e5e6 is not the same as pawn 7 moving e5e6 which is a bit odd (and probably worse for the stronger pieces which are likely to overlap squares with the same type of piece).

tewalds commented 10 months ago

It is a little more complicated than that. Most of the time the pieces can only move in the cardinal directions, but they can also jump over the opponent, including around a corner, so there are actually 12 relative moves.

I don't have a strong preference for absolute or relative movement actions, nor much of an intuition for which will be easier to learn. One thing to note is this would be a breaking change for any existing agent. I'm not sure if there are any, but it's worth considering having the option for absolute vs relative movement actions, both for backwards compatibility reasons, and so you can try both to see which is easier to learn.

tewalds commented 10 months ago

If you do end up implementing the relative actions, please do make sure it is correct for the 3/4 player variant as well. I don't remember if the board is shown in an absolute orientation or a player-relative orientation, but there is likely some complication in making the relative actions are correct for all players.

lanctot commented 10 months ago

@tewalds based on my reading of the rules on here: https://en.m.wikipedia.org/wiki/Quoridor

It seems that all those cases can get covered in just 4 movement actions, right? You only jump when an opponent is adjacent and (I assume) only around the corner when you are in the corner or jumping over an opponent in a corner? But in all cases they could be encoded by { move up, move down, move left, move right }... no?

tewalds commented 10 months ago

I guess then it's 8. When jumping around a corner you may get to choose which way to jump (north then east vs north then west).

On Sun, Dec 17, 2023, 4:25 AM lanctot @.***> wrote:

@tewalds https://github.com/tewalds based on my reading of the rules on here: https://en.m.wikipedia.org/wiki/Quoridor

It seems that all those cases can get covered in just 4 movement actions, right? You only jump when an opponent is adjacent and (I assume) only around the corner when you are in the corner or jumping over an opponent in a corner? But in all cases they could be encoded by { move up, move down, move left, move right }... no?

— Reply to this email directly, view it on GitHub https://github.com/google-deepmind/open_spiel/issues/1158#issuecomment-1859143685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB6U4LMLPI77IXD5CRG7NDYJ3JBDAVCNFSM6AAAAABAVOJDBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJZGE2DGNRYGU . You are receiving this because you were mentioned.Message ID: @.***>

aadharna commented 10 months ago

I have a 1:1 with my advisor on Tuesday and I'll chat with him showing off the game and seeing if he thinks it's appropriate for the science we're doing and then also get his opinion on if I should rewrite the game to have relative actions vs absolute actions.

aadharna commented 10 months ago

Jeff (advisor) told me to not change environments and that we can return to this once we've seen the initial results we're looking for in other domains rather than stopping science to go do a bunch of environment engineering. So, for now, I'm gonna bookmark this conversation and set it aside for a later date.

lanctot commented 10 months ago

Np, this is quite an easy change on our side so I will reopen the issue as a reminder and hopefully we can get to it at some point.

tacertain commented 6 months ago

I might try to tackle this as a learning exercise. What I think I understand is that currently the actions are IDed by the square to which the piece is going to move (i.e. moving to a4 is always the same ID regardless of where the pawn currently is). What is desired is that the moves are relative to the current pawn position, so moving one square north is always the same ID. Is that correct?

Assuming so, consider the following diagram:

X X E X X
X L A I X
H D P B F
X K C J X
X X G X X

Where P is the current pawn position, X are squares that cannot be reached in a single move, and the letters all represent potentially-valid moves. {A,B,C,D} are the basic moves that can be performed if there is nothing in the target square. If there is something in the target square, {E, F, G, H} are the "jump" moves. If there is something in the target square and there is a wall directly behind, then {I,J,K,L} are the possible "mid-air jump" moves.

As noted above, there are 12 possible squares that a pawn might be able to move to; however, they can't all ever be valid at once, leading to three possible ways (at least that I see) to encode them.

  1. Simply assign each letter a unique code
  2. Since a jump move is only valid if the normal move is blocked, assign unique codes to [(A,E), (B.F), (C,G), (D,H), I, J, K, L}
  3. Since a mid-air jump is only valid if the normal move and the jump are blocked, you could overload e.g. the (A,E) ID to also mean I if A and E are blocked - each mid-air jump means turn clockwise if it's using the overloaded ID and then there would be a separate ID for mid-air jump counter-clockwise.

I included the third for completeness, but I don't see any value in it over /1/ or /2/. Whether it's better to go with /1/ or /2/ seems like a matter of training efficiency, for which I have no insight.

Does anybody care?

tacertain commented 6 months ago

Well, it's not quite as straight-forward as I had hoped. The game encodes all the walls and player positions implicitly using x,y coordinates (interleaving the wall locations and the pawn locations), which means that effect of all the actions are state-independent and can be computed directly from the actionId. It wouldn't be too hard to change and add a state-dependent decoding step, but more involved than maybe I'm up unless somebody really wants the change.

lanctot commented 6 months ago

Hi @tacertain , apologies for the delay in respone -- I'm en route to a conference in New Zealnd right now and communication may be delayed / segmented.

I think 2 is the simplest. I like it most.

The State::ApplyAction and State::LegalActions definitely have to both change and it's non-trivial, but would make a nice change. I can textually describe what I'd do once I can look at the code. Would you be more inclined if I did that?

In the end, it is a rather low-priority fix because the person who originally brought it up is not using Quoridor as far as I can tell, and I'm not sure how much use it's getting. I'd like to have it fixed by the v1.5 release (mid to end of May) so we can take this game off the "games with known problems list" but I'd be happy to look into it myself if you don't.

(I personally prefer to actively solicit help and external interest but no worries if it's too intricate.. but I you just fixed a bug in C++ DQN that probably few OpenSpiel users would have been able to... just saying! wink wink nudge nudge, I'll say no more)

tacertain commented 6 months ago

Hi @lanctot . I hope your travel goes well - no worries on slow responses. I certainly feel capable of making the changes. I think what gives me a little pause is how to test. Can I assume that if the quoridor_test module passes, it's good? I have a lot of experience with code bases whose tests don't actually provide confidence that changes haven't introduced bugs, so I'm a little wary of that!

Second, there was a suggestion above to make the choice of absolute vs relative moves configurable. It's not at all obvious to me how to do that, given the framework. Do you have any pointers?

lanctot commented 6 months ago

Hello from 30,000 feet! (Wifi on a plane wil never cease to excite/amaze me)

The tests are super tricky esp in games, so they will almost certainly not cover all the cases. But they should catch the basic cases and we can add a few new ones by just manually executing specific playings of the lines to bring us to the tricky cases.

But I would not spend too much effort on it. If you do it, I promise to closely inspect it and think of every possible case. Not ideal but it's what I can offer. Keep in mind @tewalds's point about the 3/4 player variant.

I would rather have the game somewhat usable for RL than not at all. We have already an example of someone who would not use it because of this, so that's worse IMO.

Also we will mark it as lightly tested (which I assume is its current designation) so people are aware.

I think supporting both would make the logic needlessly complex and we should just change it.

Anyway, I leave it up to you! 😉

aadharna commented 6 months ago

Just to chime back in, me not using Quoridor is because Jeff (my advisor) and I decided to shelve the project in its entirety after about 1.5 years of work that hasn't really gone anywhere when exploring our questions using Connect 4.

I had looked at Quoridor because of the hypothesis that perhaps we were not getting good results because C4 lacked complexity in the game dynamics. However, after an AlphaZero-based approach to our problem (which I still owe you a PyTorch PR of Marc) fell into the negative pathologies we were afraid could kill the project we decided to shelve the entire thing and I've been spinning up on some new research ideas.

Best of luck on this!

lanctot commented 6 months ago

Ahh thanks @aadharna, ah ok. Sorry to hear that and always happy to get more PRs when you have time.

But even if you had gone to use quoridor, I think the current impl could make it unnecessarily difficult (awkward, even) for RL approaches. For me it was enough to put it on the "known problems" list because I think it should be fixable without too much effort.

But I will say I have had my own issues with Connect Four. It is quite an unbalanced game, tbh.. hard for player 2 to make any headway. Not sure if that's where you had issues but if you ever want to chat about it let me know -- it sounded pretty neat!

tacertain commented 6 months ago

I'm curious why it's not useful at all for RL right now. I get that it will be more efficient for training if the model gets to lump all "go north" actions together, but it should still work with absolute moves, right?

Also, it seems to me there might be a bigger problem with the game for learning, which is that the actions and board representation isn't relative to the player. I.e. player one wants to go "north" 9 times to win, but player two wants to go "south" nine times. From reading the Alpha* papers (and looking at chess.cc) it seems like you need everything to be relative - not just that "go north" is always the same, but that "go towards my goal" is always the same (and wall placements are always relative to your initial position, etc. This seems like a fatal flaw, or am I missing something.

tacertain commented 6 months ago

I realize my two paragraphs above are somewhat contradictory. I mean there are two ways in which moves can be relative. The first is that the whole board can be relative to player number, so a1 is always the far left corner relative to the player's starting pawn position. The second is that the pawn's movement actions can be described relative to the pawn or absolute using board coordinates.

I thought that this fix was about changing the second (making pawn actions relative to the current pawn position), and this doesn't seem like a fatal flaw for learning - just an inefficiency. Is that correct or am I missing something?

But the first sense of the word relative - that the board is always described relative to the player number - seems crucial and not what the game currently does. I don't see how you can do self-play training in this case. Is that correct?

lanctot commented 6 months ago

It's basically theory vs. practice. Conceptually neural nets can learn any computable function. In practice, there are severe limitations.

Overloading the absolute coordinates between wall placements and move actions means you put the onus of differentiating which case you are in on the network and training algorithm. It is a lot to ask of a parameterized function of the raw inputs when you don't have to. Having them be separate means some of the weights associated with decisions regarding walls can at least be separate from those regarding movement. Often a movement action will mean similar things even if the location of the player is different, so those state independent assessments can be made based on the action. It's basically a form of knowledge.

Actually your second paragraph makes this even more clear. In AZ we used player independent observations so that player 1 and 2 had similar perspectives and you and one network could understand playing as white vs as black regardless. In OpenSpiel most observations were designed with the independent RL use case in mind (where separate players use different networks, so differences in observation spaces across players don't matter much). Some of the games have the option for observations to be shown in a egocentric way as you describe above (always relative to the current player) for this reason but most don't.

These are all standard things practitioners use to make learning with neural nets easier.

lanctot commented 6 months ago

Yeah certainly not a fatal flaw, but an unnecessary inefficiency that we should fix because it's relatively straight forward and most of our games don't have this issue.

tacertain commented 6 months ago

@lanctot When you get a chance, can you look at the PR and let me know what you think on the topic of default behavior (and any other feedback you have)?

tacertain commented 5 months ago

@lanctot any thoughts? I'd like to close this out.

lanctot commented 5 months ago

@lanctot any thoughts? I'd like to close this out.

Hey... I've got a lot of catching up to do since coming back and two deadlines on some work tomorrow. Unfortunately these larger PRs can take some time as I've got several other higher-priority things on my todo list. Sorry for the delay. I'll try to take a look by Thursday.

tacertain commented 5 months ago

Can you give me a thought on the default question?

I can keep the default to absolute which means any old models are correct but the default is not ideal, or I can make the default relative and see if there's anything I can do to make it more obvious that old models won't work.

That's the main thing is left for me to close out. Then the PR will be all ready to review.

Thanks

On Tue, May 14, 2024, 11:48 AM lanctot @.***> wrote:

@lanctot https://github.com/lanctot any thoughts? I'd like to close this out.

Hey... I've got a lot of catching up to do since coming back and two deadlines on some work tomorrow. Unfortunately these larger PRs can take some time as it's not my highest priority. Sorry about that. I'll try to take a look by Thursday.

— Reply to this email directly, view it on GitHub https://github.com/google-deepmind/open_spiel/issues/1158#issuecomment-2110910386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVRYKTMSOL2RIHVV76XYTZCJMBBAVCNFSM6AAAAABAVOJDBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQHEYTAMZYGY . You are receiving this because you were mentioned.Message ID: @.***>

lanctot commented 5 months ago

Oh sorry, I didn't realize your were blocked on me. I thought I had answered this question before: we should entirely move to the new relative format and not even support the old format.

We should not worry about backward-compatibility for the reasons that:

  1. Changes on the master branch can be backward-incompatible; people can always go back versions if they want the old functionality
  2. I don't think Quoridor is being heavily used right now. If it is being used, it's probably by people who checked out their own copy or use it on their own fork and don't update from master.
  3. We marked this as a "Known issue" because it's bad enough that it should just be fixed (replaced).

Hope this helps

lanctot commented 5 months ago

And, once done (like maybe in this PR?), we should you should remove the game from the Known Issues list here: https://github.com/google-deepmind/open_spiel/blob/7d3a355b4927acf4d21fd76e2ea799d9f5c0bb7a/open_spiel/spiel.cc#L172

tacertain commented 5 months ago

Sounds good. You're right: on rereading the earlier comments you said to scrap absolute movements. I'll do that and update the PR.

Also, to be clear, for all players, moving "north" on the board is the same action ID. For player 1 it's towards the goal, for player 2 it's away from the goal, for 3/4 it's orthogonal to the goal. So a model trained as P1 will not play well as other players. My understanding from the above discussion is that that's all well and good, but just confirming.

lanctot commented 5 months ago

So a model trained as P1 will not play well as other players. My understanding from the above discussion is that that's all well and good, but just confirming.

Confirmed. This is true for almost all games in OpenSpiel as the "independent RL" case is mostly assumed, so it's fine to continue here.

Since you might be looking into self-play with AlphaZero, there a single network is re-used and so it would work better if some games supported an ego-centric observation that aligned across players. This is something we'd ideally have for all games and right now we unfortunately only have it for a small fraction of games. Hopefully we can get better support for different observation types eventually (but it may have to come in the form of external contributions -- if you look at a particular game and want to do this I'd be happy to describe how to go about it).

tacertain commented 5 months ago

I am interested in playing around with AlphaZero. Once you have cleared through your backlog maybe we can have a chat about it.

On Tue, May 14, 2024 at 1:00 PM lanctot @.***> wrote:

So a model trained as P1 will not play well as other players. My understanding from the above discussion is that that's all well and good, but just confirming.

Confirmed. This is true for almost all games in OpenSpiel as the "independent RL" case is mostly assumed, so it's fine to continue here.

Since you might be looking into self-play with AlphaZero, there a single network is re-used and so it would work better if some games supported an ego-centric observation that aligned across players. This is something we'd ideally have for all games and right now we unfortunately only have it for a small fraction of games. Hopefully we can get better support for different observation types eventually (but it may have to come in the form of external contributions -- if you look at a particular game and want to do this I'd be happy to describe how to go about it).

— Reply to this email directly, view it on GitHub https://github.com/google-deepmind/open_spiel/issues/1158#issuecomment-2111042093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVRYPHGVEMC2SQSMTMRX3ZCJUO5AVCNFSM6AAAAABAVOJDBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGA2DEMBZGM . You are receiving this because you were mentioned.Message ID: @.***>

tacertain commented 5 months ago

OK, I have the code changes done and PR #1229 submitted. Of course the playthrough tests fail because the action IDs are different. It looks like it's easy enough to regenerate the test files using action strings and get the new action IDs out. Will look into that next.

tacertain commented 5 months ago

Playthroughs have been updated. I think the PR is ready.