Invalid Actions, Mask and DQN

Cyazd commented 2 years ago

Hi to everybody,

I'm using the DQN Algorithm for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.

My issue relates to the invalid moves : I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, following the advice given by araffin here I give a negative reward equal to the max score one can obtain and stop the game. I also tried other strategies like giving a huge negative reward or a small one and keeping the game on until the model finds a move which would not be invalid.

Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried, the first model I trained still predicts invalid moves.

I've thought of different solutions, but can't figure out if it's a good one, or how to do it :

I feel like stoping the game if the opponent makes to many invalid move wouldn't be a solution...
I saw that masking invalid move can be a solution. But i've seen it only on examples with the PPO algorithm, and can't find the docs on how to do it
I thought also of getting the next best predicted move, but model.predict only gives the best one, and I can't see how to get the second best one.

All in all, I'm a bit stuck, and would be very grateful if somebody could help me out.

Thanking you in advance.

araffin commented 2 years ago

Hello,

First thing first, I would highly recommend you to switch to stable-baselines3.

I saw that masking invalid move can be a solution. But i've seen it only on examples with the PPO algorithm, and can't find the docs on how to do it

you should take a look at https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html and at the paper/code associated.

For DQN, you would need to set the q-values to -inf to mask them.

I thought also of getting the next best predicted move

For that you need a model of your environment to know what would be the next moves, this is called model-based RL (and is not the focus of Stable-Baselines).

Cyazd commented 2 years ago

Thank you very much for your quick answer.

For DQN, you would need to set the q-values to -inf to mask them.

I have a stupid question : how can I get and set those q-values ? I've honestly been looking a lot around the docs, but couldn't find an answer.

(Sorry for the stupid question)

araffin commented 2 years ago

I have a stupid question : how can I get and set those q-values ? I've honestly been looking a lot around the docs, but couldn't find an answer.

See https://github.com/DLR-RM/stable-baselines3/issues/568

hill-a / stable-baselines

Invalid Actions, Mask and DQN #1155