Not use the updated policy

In ch_03 https://github.com/maxpumperla/learning_ray/blob/main/notebooks/ch_03_core_app.ipynb train_policy_parallel function,

def train_policy_parallel(env, num_episodes=1000, num_simulations=4):
    """Parallel policy training function."""
    policy = Policy(env)
    simulations = [SimulationActor.remote() for _ in range(num_simulations)]

    policy_ref = ray.put(policy)
    for _ in range(num_episodes):
        experiences = [sim.rollout.remote(policy_ref) for sim in simulations]

        while len(experiences) > 0:
            finished, experiences = ray.wait(experiences)
            for xp in ray.get(finished):
                update_policy(policy, xp)

    return policy

If i'm not mistaken, it appears that each episode use the initially initialized policy rather than the updated one

maxpumperla / learning_ray

Not use the updated policy #12