ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.52k stars 5.69k forks source link

Duplicated IDs are generated #12197

Open kfstorm opened 3 years ago

kfstorm commented 3 years ago

What is the problem?

The string generated by GenerateUniqueBytes is not really unique. https://github.com/ray-project/ray/blob/0178d6318ead643fff6d9ed873c88f71fdac52e7/src/ray/common/id.cc#L40-L53

Ray version and other system information (Python version, TensorFlow version, OS): 1.0.1

Reproduction (REQUIRED)

Apply below patch and run bazel run //:id_test,

diff --git a/src/ray/common/id_test.cc b/src/ray/common/id_test.cc
index 926e6fbd11..f1274d92a5 100644
--- a/src/ray/common/id_test.cc
+++ b/src/ray/common/id_test.cc
@@ -51,6 +51,15 @@ TEST(ActorIDTest, TestActorID) {
     const ActorID actor_id = ActorID::Of(kDefaultJobId, kDefaultDriverTaskId, 1);
     ASSERT_EQ(kDefaultJobId, actor_id.JobId());
   }
+
+  {
+    // test no duplicated ID
+    std::unordered_set<ActorID> ids;
+    for (size_t i = 0; i < 1000000; i++) {
+        auto id = ActorID::Of(kDefaultJobId, kDefaultDriverTaskId, i);
+        RAY_CHECK(ids.insert(id).second) << "Duplicated ID generated: " << id;
+    }
+  }
 }

 TEST(TaskIDTest, TestTaskID) {

You will see the error:

[2020-11-20 05:37:07,442 C 104416 104416] id_test.cc:60:  Check failed: ids.insert(id).second Duplicated ID generated: 584a7e60c7000000

If we cannot run your script, we cannot fix your issue.

kfstorm commented 3 years ago

This may relate to #9551 and https://github.com/ray-project/ray/issues/11715#issuecomment-729329913

rkooo567 commented 3 years ago

Oh, I also saw users who had this issue.

rkooo567 commented 3 years ago

Will you push the fix?

kfstorm commented 3 years ago

Nope. I don't know how to fix it.

rkooo567 commented 3 years ago

Actually, this might just make sense. We have 4 unique bytes, meaning there are 4294967295 cases. If I use the birthday paradox formula, the number of candidates to get a probability of 10% of conflict among 4 bytes number 29000. With 100000 actors, the probability to have conflict converges to 100%.

rkooo567 commented 3 years ago
Screen Shot 2020-11-20 at 10 52 52 AM
In [10]: sqrt((9999/10000)*2*4294967295)                                                                                                          
Out[10]: 92677.26580203476

Reference is here just in case I probably did a mistake in calculation.

kfstorm commented 3 years ago

@rkooo567 I agree with your analysis. I think we need to either come out with a better ID generation algorithm (I'm not sure if it's possible) or add some kind of communication between nodes/workers to reduce the chance of conflict. Although 100000 seems pretty large as the number of actors, we still need to be cautious because a small group of users may have already suffered from this for a while. And in Ant, we have some applications which create and destroy actors periodically. So it's only a matter of time.

rkooo567 commented 3 years ago

Based on this https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md, every different job would have a different actor id though. So this means each job can create around 3000~ actors with < 0.1% of duplicated id probability. How common is that your single job has more than 3000 actors (if common, we should definitely figure out a way to improve this I guess)?

Dadle commented 3 years ago

We have been suffering from this issue for months. it is costly and we are working on external patch to automate restart to avoid this issue. The user experience is that Ray is unstable unfortunately.

We run evolutionary algorithm doing fitness calculations as ray actors. These actors are distributed across 100+ 2CPU EC2 isntance with 200+actors multiple times per second. We run this workload for days at a time. So for us to run 50 000 000+ actors on one cluster for one workload is very common.

Is there a known workaround to not create a new actor ID for each new actor?

rkooo567 commented 3 years ago

Just to confirm, your workload requires to create 50 000 000+ worker within a single driver?

Also, do you check failures occasionally? Do you mind sharing your error messages?

Is there a known workaround to not create a new actor ID for each new actor?

I think we just create a random actor id for each new actor.

Dadle commented 3 years ago

Here is some very simplified pseudocode of the workload we do. Eachgeneration loop takes 2-3 seconds for a large workload and we generate +-5 actors per worker node. The worker nodes are reused, but actors are recreated each generation loop

def run_evolution(func):
    while generation <= 100000:
            # Update population
            population = select_population(last_generation_population, number=5000)
            population = population.mutate_and_mate()

            n_workers = 100 # number of worder nodes currently available in reality
            batches = 500
            id_for_reorder = range(len(batches))
            actor_pool = ActorPool([map_function.remote(individual) for _ in range(n_workers)])
            unordered_results = list(
                eval_pool.map_unordered(lambda actor, input_tuple: actor.ray_remote_eval_batch.remote(func, input_tuple),
                                                           zip(batches, id_for_reorder)))
           ordered_fitness_results = [batch for batch_id in id_for_reorder 
               for batch in unordered_results  # unordered_fitness_list if batch_id == batch[0][1]]
           population.fitness = [item[0] for sublist in ordered_fitness_results for item in sublist]

            hall_of_fame.update(offspring)
    return hall_of_fame[0]

This is the error we get at random points with many hours in between: image

Dadle commented 3 years ago

@rkooo567 We have a running test version of our code here stripped down as much as I dared to not remove the source of the error. Not sure it still causes the error at this level actuallysince it has not been run for long enough yet, but should give you an idea of the workload in more detail

https://gitlab.com/kvsoldal/ray-deap-evolution-crash-test/-/tree/master

rkooo567 commented 3 years ago

I see.

There are 2 comments here;

kfstorm commented 3 years ago

@rkooo567 If you look at the stack trace, it crashed when creating a new actor. So I believe the root cause is still duplicate actor IDs.

kfstorm commented 3 years ago

This issue will be eased by https://github.com/ray-project/ray/pull/12894