Open zhengtiantian opened 2 years ago
Hi @zhengtiantian , thanks!
The overall behaviour of the policy that an RL agent will learn depends on what is optimal for the underlying MDP (i.e., the environment/gym/aviary class). If the environment has termination conditions on the boundary and negative rewards it is unlikely that an agent will learn how to stop because it can simply "hack the reward" by reaching a high reward/low negative reward point in the state space and then try to terminate early by leaving the arena.
Depending on which type of behaviour you are trying to learn, you should carefully choose reward
and done
signals.
Hi: First of all thank you for creating such a great work.
The action type I used is vel, and I used A2C training, it can fly in the right direction, but it continues to fly after reaching the end, how can I make it stop? I set a bonus value of -1 on both the borders