Closed earthwuyang closed 2 years ago
Hello @earthwuyang
Thanks for opening the issue here. It's very possible that my solution is wrong!
Can you, please, explain in a bit more detail what is the correct solution and where did I make a mistake?
Thank you! I might also be wrong. I would be thankful if you could point out my error if it's my mistake.
I recalculated my solution from scratch and you're right! I'm not sure how I arrived at my solution...
Anyway, to make the calculation clear, I think, one has to keep vector with sum of rewards for each action Si
together with a counter ni
. The counter tracks the number of times a particular state has been taken. The action-value is then calculated as a ratio of sum of rewards over the number of visits Qi = Si/ni
.
Here's the full updated solution (it exactly matches yours):
.
I think the calculation of the rewards of each arm in each step is not sample-average according to the formula given in the book.