luwo9 / bomberman_rl

Reinformcement learning for Bomberman: Machine Learning Essentials lecture 2024 final project
0 stars 0 forks source link

Multiple runs of main.py #10

Closed luwo9 closed 3 months ago

luwo9 commented 4 months ago

As of yet, it is unclear if training a well-crafted agent in a single configuration is sufficient. That is, it may be required to first train in, say a peaceful environment or against weaker agents and then in harder environments.

However, eventhough the code allows to train several rounds at once, as of my knowledge, it does not allow switching environments during this time. This means the main.py script needs to be run several times. This would have some consequences:

  1. Any "number of games played" counter provided by the bomberman framework is not usable (and not compatible with e.g. the length of the training set)
  2. A rewarder needs to respect this a "training wheels" training may require different rewards etc. than a "harsh/real-world" game
    • Either the rewarder needs to change its behaviour after a certain amount of rounds (=the number of rounds after which the environment changes)
    • Or one needs different rewarders meaning a code change is probably required in between the main.py calls)
  3. Maybe other things like batch size etc. change aswell which might require code change aswell

This could mean that an agent is not simply groupable as a collection of its parts (say, Regression model, rewarder, training set, sampler) but that they may change for an agent over its training process. In that case some thought needs to be put into how to save agents and package them up, and how to (still) allow for streamlined, automated training.

This realtes to #7 in that sense.

luwo9 commented 4 months ago

The best way seems to be to really define the enviroment changes after, e.g., 100 and 200 full games and then create Samplers, Rewarders etc that are aware of this and also change their internal setting after this many steps. Same goes e.g. for the training memory such that it resets, say ,after the environments change. Those objects are the least general and thus should be adapted to the environment and picked at the top level, while the most general objects (q agents, q handlers should not need to be adapted to such a case)

But is this really the best way, to define e.g. a rewarder that changes behaviour from one base implementation to another (~coin rewarder -> win rewarder)?

luwo9 commented 3 months ago

Thinking more about this, maybe a Bundle in bombermans.py should be redesigned as follows:

This would require small changes, such that an object is only responsible for saving/loading itself, not the objects it holds onto.

Such custom adjustments could, e.g., include:

This would also allow e.g. saving without the training set for a final agent version to e.g. be submitted, reducing the file size.

For this it must be informed what the current run of main.py is intending, requiring a solution in sync with #7.

One could e.g. write a configuration file that configures what should be done (=loaded/saved/swapped out) in main.py, and possibly even the different configurations intended for th corresponding main.py run. Maybe a custom script could then automate the main.py runs aswell.

One should think about consequences of usage of the bombermans.py Bundles when e.g. wanting to continue to train a model but with a rewarder that was newly designed.