def train_policy_parallel(env, num_episodes=1000, num_simulations=4):
"""Parallel policy training function."""
policy = Policy(env)
simulations = [SimulationActor.remote() for _ in range(num_simulations)]
policy_ref = ray.put(policy)
for _ in range(num_episodes):
experiences = [sim.rollout.remote(policy_ref) for sim in simulations]
while len(experiences) > 0:
finished, experiences = ray.wait(experiences)
for xp in ray.get(finished):
update_policy(policy, xp)
return policy
If i'm not mistaken, it appears that each episode use the initially initialized policy rather than the updated one
In ch_03 https://github.com/maxpumperla/learning_ray/blob/main/notebooks/ch_03_core_app.ipynb train_policy_parallel function,
If i'm not mistaken, it appears that each episode use the initially initialized policy rather than the updated one