luwo9 / bomberman_rl

Reinformcement learning for Bomberman: Machine Learning Essentials lecture 2024 final project
0 stars 0 forks source link

Reward shaping for kills #46

Open luwo9 opened 4 weeks ago

luwo9 commented 4 weeks ago

While not clear yet, it is likely that killing opponents or laying bombs next to them will rarely happen during normal training. In this case one might need to make use of suitably shaped rewards (that leave the policy invariant), e.g., rewarding getting closer to an oponent player.

RuneRost commented 3 weeks ago

I'm not sure if it is useful to reward getting closer to another player since this might also lead to a higher chance of getting killed. It might however be worth a try to award in the beginnings of traning and later leave it out.

Additional ideas for rewarding:

reward if enemy is in bomb range (higher reward if countdown of bomb is lower) reward placing bombs when enemies are close one could think about rewarding placing bombs in places where enemies often move (e.g. near to coins) - might however be rather complicated one could reward placing bombs close to the explosion range of other bombs, so when enemies try to evade from this region, they run into the newly placed bomb one could try to predict enemy movement (e.g. neural network that does predict their next move(s) and than reward if a bomb is placed at the position where they move)

still we should reward suviving the most I think and also punish badly placed bombs

RuneRost commented 3 weeks ago

After rethinking I think we should definitely at least try out to reward moving closer to enemies in the beginning - question is how do we efficiently calculate distance to enemies in this maze to check if it decreases

luwo9 commented 3 weeks ago

Exactly, i agree. Rewarding closeness should be a non-policy affecting reward shaping, that just helps the agent to get more data about killing opponents.

The other ideas are als really good.

I am not sure about rewarding survival and if that may lead to loops/waiting.

About the distance: I am also not sure if there is a straightforward/easy way to do that. Maybe it would just be enough to compute $|x_1-x_2|+|y_1-y_2|$, and just be cautious with rewarding it. Maybe it helps not rewarding for getting closer if the distance is only, say 5 anyways. But maybe tests with the peaceful agent etc show that this reward is not strictly necessary.