instadeepai / Mava

🦁 A research-friendly codebase for fast experimentation of multi-agent reinforcement learning in JAX
Apache License 2.0
709 stars 83 forks source link

[BUG] Potential Recurrent IPPO Bug #1006

Closed EdanToledo closed 3 months ago

EdanToledo commented 7 months ago

Describe the bug

Its possible there is a slight error in the resetting of hidden states for IPPO. Instead of using the done flag to reset, it would be simpler and ensure correctness to use the timestep.first() flag to reset the hidden state. I've linked below a picture of a rware tiny-2ag run using the fix and using the current develop branch. Although both are increasing there is roughly a 100% or 2x performance at each interval point.

Below is the Actor Mean Return: Screenshot 2024-02-07 at 11 24 49

Below is the Evaluator Mean Return: Screenshot 2024-02-07 at 11 26 06

Although this could be variance, nothing was changed between the experiments besides the fix. I've pushed the branch that ran the orange line experiment and it is called fix/recurrent-ppo. This would be a worthwhile investigation.

RuanJohn commented 7 months ago

A link to the branch with this fix.

RuanJohn commented 7 months ago

An update on this, the suggested fix uses timestep.first() to reset the hidden states, but since no timestep will ever be a timestep.first() when using Jumanji's auto reset wrapper the hidden states were just never getting reset. Not ever resetting the hidden states does lead to far better performance for the recurrent systems though so this is a whole other question to consider in itself.

sash-a commented 3 months ago

Bug was fixed in #1076