A question about "building a distributed Ray Training" section in Chapter 3

sagebei commented 2 years ago

Hi, thank you for making this book available before publishing. It is a treasure for learning ray. I have a doubt about a function in Chapter 3 at section "building a distributed Ray Training" as copied below. This function synchronously updates the policy, which is in essence a Q table, with 4 tasks each of which has an experience list. I wonder if two tasks are updating the same Q value, would this cause a conflict?

def update_policy_task(policy_ref, experiences_list): """Remote Ray task for updating a policy with experiences in parallel.""" [update_policy(policy_ref, ray.get(xp)) for xp in experiences_list] return policy_ref

If the policy is a neural network, do we need to lock the weight for each update? Because the fact that each task updates the weights independently might lead to the problem that one task updates the weights on the ones that have been updated by another task, not the weights the gradients were calculated on. This should have already been dealt with in the Ray Train, but I do not know how it is handled. This problem has been haunting in my mind for a long while. Could you please share your thoughts about it? Thank you very much. I am looking forward to your reply.

maxpumperla commented 2 years ago

@sagebei sorry for the late reply, I sometimes get lost in GitHub notifications. You're right that you can in principle have race conditions there. and yes, in practice you will have to take care of that (there are different ways of ensuring that, acquiring a lock is one of them). the worst thing that can happen above is that some updates simply get lost.

sagebei commented 2 years ago

@maxpumperla Thank you very much for your reply! I closed the issue because I found there is no problem in the code. My understanding (which might not be correct) is that although the update_policy_task is invoked by num_episodes times in the for-loop and runs in parallel, the update_policy_tasks are "chained" together. as the policy_ref in the parameter comes from the policy_ref returned from the prior function call. Inside the update_policy_task, Ray actually does two extra things under the hood for us, which are ray.get() and ray.put() as shown below.

@ray.remote def update_policy_task(policy_ref, experiences_list): # policy = ray.get(policy_ref) policy_ref: ObjectRef(7df446e0be2f9350ffffffffffffffffffffffff0100000001000000) [update_policy(policy_ref, ray.get(xp)) for xp in experiences_list] # policy_ref = ray.put(policy) policy_ref: ObjectRef(80f450872c2ccadaffffffffffffffffffffffff0100000001000000) return policy_ref

As ray.get is a waiting function, the function must be wait until the execution of the prior function gets finished. I have been fiddling with code for a while, and still cannot make sure that I understand the code correctly. Please correct me if my understanding is wrong. Much appreciated!

maxpumperla commented 1 year ago

@sagebei apologies for the long turnaround. we've now updated the example (https://github.com/maxpumperla/learning_ray/blob/main/notebooks/ch_03_core_app.ipynb) to only do rollouts in parallel, not the actual update step, as this was both confusing (e.g. the race conditions you mentioned) and unnecessary. also note that this pattern (distributed rollouts, central updates to a policy on the driver) is how RLlib currently does things as well.

Hope that helps!

maxpumperla / learning_ray

A question about "building a distributed Ray Training" section in Chapter 3 #2