Open luwo9 opened 3 months ago
One way would be letting, e.g., a strong rule based agent play, and collect the respective states, actions, etc. (and what rewards would be given) and then feed that data to the Q learning agent.
Instead of maybe live feeding, which would be more work to implement (The Qhandlers need to compute the valid coords and the Samplers assume data from a live trained agent.), one could simply manually pre-train a regression model on this data (but need some kind of Qhandler for computing temporal difference, SARSA etc.). Or, probably simpliest, one could use a Normal QAgent/QHandler but the rule based agent/human/... is actually playing, this could also serve as pretraining. I.e. one could just define another QAgent (or even only a policy modifier!!) that works normally but whenever it picks an action it picks the one of a rule based agent/user (maybe with a certain probability, or say for the first $n$ rounds only.
This could probably be implemented quickly and could lead to interesting results. Because this way the agent/Q Network would see much more high quality states and much more "good moves". However, it will never really see bad moves and experience many penalties maybe. Maybe a policy modifier would just spread in decisions of a rule based agent witha set probability.
This could also be a way to start training for the final game, if training isn't converging in the beginning
Note, that training data, i.e., tuples $(s,a,r,s')$ may not just come from the agent that is actually playing but also e.g. the opponents. Maybe one can implement taking also actions that the opponents took in the training data somehow? But this would need to not conflict with return computation etc. and may thus be tricky.