Updates before start_steps

unrealwill commented 5 years ago

Hello,

In sac.py,

        if t > start_steps:
            a = get_action(o)
        else:
            a = env.action_space.sample()

You use random policy before start_steps. But you nevertheless start updating model parameters immediately, using a small replay memory dataset. It seems that a cautious approach would only update the model parameters once a sufficient dataset has been collected.

Currently we do start_steps model updates with a small dataset which mean we risk initially over-fitting the parameters, to this small dataset, which may take a long time to recover from.

It is particularly insidious, because when you have a slow network architecture you won't see a problem, but once you try a faster architecture you will overfit to the small dataset and take a long time to recover. It is also environment dependent and may depend on the luck of the first few episodes.

jachiam commented 5 years ago

Can you identify an environment where this happens? So far this isn't a problem for the environments tested.

unrealwill commented 5 years ago

I can't say for sure, what happens in your code, I haven't run it yet.

I'm currently fiddling with my code loosely inspired by yours. Toggling extra terms and varying some parameters to get a feel from the algorithm.

I'm mostly playing with simple environments, 'Pendulum-v0', 'LunarLanderContinuous-v2', "BipedalWalkerHardcore-v2", but It'll probably have more impact when you have short episodes (and a big batch_size),

In Pendulum-v0, I observed that sometimes it didn't converge with networks of bigger layer size 300 whereas it did when the layer size was 100. It was spinning it up very fast always in the same direction. In BipedalWalker, I observed that it would stiffen its leg, the action becoming saturated at the border of the action space, it would then "fall" and start exploring from the border of the action space.

Not doing updates before start_steps had some impact and helps mitigate those undesired behaviours.

My code is still buggy, so maybe yours can recover from it, but I figure this issue is quite orthogonal and make sense so you are probably affected too :)

openai / spinningup

Updates before start_steps #48