Training policy for more complex tasks, converges to sub-optimal solutions

I recently implemented a gym environment where a robot should learn to push different boxes conditioned on different skills, getting only sparse rewards. I wanted to train the agent using the SAC implementation from this repository. There I observed, that for more complex problems the agent seems to converge quickly to some non-optimal policy, where it would not get any reward, or only little reward in the case where I used reward shaping. Thus, I used the exact same gym environment and trained the agent with the SAC implementation from stable baselines3. I made sure that all the hyperparameters were the same as when I trained it with this implementation. From the following plot, it is visible that the latter training has much better performance. In the middle task, where the agent has to learn to push only one box, from different initial positions, the agent trained with this implementation performs quite well. However, in the right task, where the agent has to push four boxes from different initial positions it does not show any improvement.

For other less complex tasks, one of which is plotted in the left plots, I also managed to successfully train an agent with this implementation. I observed, that using this implementation, the agent seems to explore much less than with the implementation from Stable Baselines3, which might be why for more complex tasks, where more exploratory behavior is necessary to find good states, the result is so bad. However, unfortunately, I was not able to find any specific part of the code that might have an error in it.

Comparison

I am not looking for any solutions any longer, although I would be very much interested if someone finds the reason why this implementation has poor performance compared to the one from Stable Baselines3.

This is more of a disclaimer, that the implementation might not work for all tasks, than a problem I need help with.

pranz24 / pytorch-soft-actor-critic

Training policy for more complex tasks, converges to sub-optimal solutions #45