Open kfstorm opened 3 years ago
This may relate to #9551 and https://github.com/ray-project/ray/issues/11715#issuecomment-729329913
Oh, I also saw users who had this issue.
Will you push the fix?
Nope. I don't know how to fix it.
Actually, this might just make sense. We have 4 unique bytes, meaning there are 4294967295 cases. If I use the birthday paradox formula, the number of candidates to get a probability of 10% of conflict among 4 bytes number 29000. With 100000 actors, the probability to have conflict converges to 100%.
In [10]: sqrt((9999/10000)*2*4294967295)
Out[10]: 92677.26580203476
Reference is here just in case I probably did a mistake in calculation.
@rkooo567 I agree with your analysis. I think we need to either come out with a better ID generation algorithm (I'm not sure if it's possible) or add some kind of communication between nodes/workers to reduce the chance of conflict. Although 100000 seems pretty large as the number of actors, we still need to be cautious because a small group of users may have already suffered from this for a while. And in Ant, we have some applications which create and destroy actors periodically. So it's only a matter of time.
Based on this https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md, every different job would have a different actor id though. So this means each job can create around 3000~ actors with < 0.1% of duplicated id probability. How common is that your single job has more than 3000 actors (if common, we should definitely figure out a way to improve this I guess)?
We have been suffering from this issue for months. it is costly and we are working on external patch to automate restart to avoid this issue. The user experience is that Ray is unstable unfortunately.
We run evolutionary algorithm doing fitness calculations as ray actors. These actors are distributed across 100+ 2CPU EC2 isntance with 200+actors multiple times per second. We run this workload for days at a time. So for us to run 50 000 000+ actors on one cluster for one workload is very common.
Is there a known workaround to not create a new actor ID for each new actor?
Just to confirm, your workload requires to create 50 000 000+
worker within a single driver?
Also, do you check failures
occasionally? Do you mind sharing your error messages?
Is there a known workaround to not create a new actor ID for each new actor?
I think we just create a random actor id for each new actor.
Here is some very simplified pseudocode of the workload we do. Eachgeneration loop takes 2-3 seconds for a large workload and we generate +-5 actors per worker node. The worker nodes are reused, but actors are recreated each generation loop
def run_evolution(func):
while generation <= 100000:
# Update population
population = select_population(last_generation_population, number=5000)
population = population.mutate_and_mate()
n_workers = 100 # number of worder nodes currently available in reality
batches = 500
id_for_reorder = range(len(batches))
actor_pool = ActorPool([map_function.remote(individual) for _ in range(n_workers)])
unordered_results = list(
eval_pool.map_unordered(lambda actor, input_tuple: actor.ray_remote_eval_batch.remote(func, input_tuple),
zip(batches, id_for_reorder)))
ordered_fitness_results = [batch for batch_id in id_for_reorder
for batch in unordered_results # unordered_fitness_list if batch_id == batch[0][1]]
population.fitness = [item[0] for sublist in ordered_fitness_results for item in sublist]
hall_of_fame.update(offspring)
return hall_of_fame[0]
This is the error we get at random points with many hours in between:
@rkooo567 We have a running test version of our code here stripped down as much as I dared to not remove the source of the error. Not sure it still causes the error at this level actuallysince it has not been run for long enough yet, but should give you an idea of the workload in more detail
https://gitlab.com/kvsoldal/ray-deap-evolution-crash-test/-/tree/master
I see.
There are 2 comments here;
@rkooo567 If you look at the stack trace, it crashed when creating a new actor. So I believe the root cause is still duplicate actor IDs.
This issue will be eased by https://github.com/ray-project/ray/pull/12894
What is the problem?
The string generated by
GenerateUniqueBytes
is not really unique. https://github.com/ray-project/ray/blob/0178d6318ead643fff6d9ed873c88f71fdac52e7/src/ray/common/id.cc#L40-L53Ray version and other system information (Python version, TensorFlow version, OS): 1.0.1
Reproduction (REQUIRED)
Apply below patch and run
bazel run //:id_test
,You will see the error:
If we cannot run your script, we cannot fix your issue.