Hi Michelangelo,
nice implementation of the Q-learning algorithm.
I liked that you implemented several opponent strategies for testing your trained model, I don't understand why you don't use them also for training!
You could reach better performances training your model not only with the random opponent strategy but also with more optimal strategies.
Another point is the choice of a constant epsilon value.
To encourage exploration a decreasing epsilon is recommended in my opinion to avoid leading into a suboptimal policy, due to the fact that your training agent tends to exploit the actual optimal policy, following the best action-value instead of making some random action for exploration.
Anyway good job!
Hi Michelangelo, nice implementation of the Q-learning algorithm. I liked that you implemented several opponent strategies for testing your trained model, I don't understand why you don't use them also for training! You could reach better performances training your model not only with the random opponent strategy but also with more optimal strategies. Another point is the choice of a constant epsilon value. To encourage exploration a decreasing epsilon is recommended in my opinion to avoid leading into a suboptimal policy, due to the fact that your training agent tends to exploit the actual optimal policy, following the best action-value instead of making some random action for exploration. Anyway good job!