openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.19k stars 2.23k forks source link

Order of loss computation and gradient step seems wrong in vpg.py #211

Open gauravjain14 opened 4 years ago

gauravjain14 commented 4 years ago
    def update():
        data = buf.get()

        # Get loss and info values before update
        pi_l_old, pi_info_old = compute_loss_pi(data)
        pi_l_old = pi_l_old.item()
        v_l_old = compute_loss_v(data).item()

        # Train policy with a single step of gradient descent
        pi_optimizer.zero_grad()
        loss_pi, pi_info = compute_loss_pi(data)
        loss_pi.backward()
        mpi_avg_grads(ac.pi)    # average grads across MPI processes
        pi_optimizer.step()

Here the comment says that train the policy with a single step of gradient descent and then calculate the delta difference.

However, from the code, it seems like that the backward propagation and step() is being performed after both old and the new values of loss are calculated. Shouldn't the following be the order:

        # Train policy with a single step of gradient descent
        pi_optimizer.zero_grad()
        pi_l_old.backward()       # backprop
        pi_optimizer.step()         # Perform gradient descent
        mpi_avg_grads(ac.pi)    # average grads across MPI processes
        loss_pi, pi_info = compute_loss_pi(data)    # Compute the new loss

Or there is something wrong in my fundamental understanding?

jachiam commented 4 years ago

I believe you are right that the calculation of the delta is bugged, and will always yield zero in the current setup. I'll try fixing it over the weekend.

gauravjain14 commented 4 years ago

Thanks!

I am not sure if you want a PR for this. I could do that as well if the order that I have mentioned seems right to you

jachiam commented 4 years ago

No need, I'll get it figured out.