Error when starting too many workers.

ray-project / ray-legacy

An experimental distributed execution engine

BSD 3-Clause "New" or "Revised" License

21 stars 18 forks source link

Error when starting too many workers. #431

Open robertnishihara opened 8 years ago

robertnishihara commented 8 years ago

If I start 150 workers on my laptop and run the following, Ray crashes.

import ray

ray.init(start_ray_local=True, num_workers=150)

@ray.remote
def f():
  pass

It fails with

E0915 18:03:15.745690000 123145305522176 wakeup_fd_pipe.c:53] pipe creation failed (24): Too many open files

mehrdadn commented 8 years ago

Options:

setrlimit(RLIMIT_NOFILE)
Use e.g. sockets + ports instead of pipes and connect them dynamically.

Though I don't think you should have that many workers on a single machine...

robertnishihara commented 8 years ago

Thanks! There's no need to have that many workers on my laptop, but on a machine with 100+ cores it makes sense. I suppose we either want to raise the limit or at least figure out the limit and then throw a Python exception if the user tries to start too many workers.

mehrdadn commented 8 years ago

Related note: you need 1 worker process per dependency depth in the computation graph, right?

robertnishihara commented 8 years ago

Not quite, the problem is when a remote function calls ray.get because then it blocks while waiting for another remote function to execute. So it's possible for every worker to be executing a remote function which is blocked in a get in which case the program will hang. So as an upper bound, we could need a number of workers equal to one plus the number of tasks that all call get and could be executing at the same time.

mehrdadn commented 8 years ago

Yeah, ok. IMO this should not be a constraint at all -- you should be able to run any program to completion with just 1 worker, albeit more slowly. There's at least one way (maybe more) of doing this, but it can require a redesign of the programming model, so you may want to look into that sooner rather than later.

robertnishihara commented 8 years ago

How would you do it?

mehrdadn commented 8 years ago

Well personally I would do it using the completion-based notification model I'd proposed earlier, which prevents blocking control flow altogether. Another possibility is fibers (cooperative multithreading), so that wait() actually suspends the current fiber and switches to a new one to service new requests without creating a new OS process/thread, but I'm a bit hesitant about this one.

ptyshevs commented 4 years ago

@mehrdadn any updates on fibers? Is there any option currently to prevent blocking control flow during get?

mehrdadn commented 4 years ago

Uh I'm not sure unfortunately, it's been 4 years. :\