Open dwiel opened 6 years ago
this needs either a fix in execnet, or work on supporting multiprocessing/mitogen as a backend
what do you think about the work involved in supporting alternative backends? Everything seems fairly tightly coupled to execnet right now, though perhaps not as much as i think.
i didnt even start with the initial analysis but i did decide to stop working on execnet myself
any progress on this?
nope
this seems like it would be amenable to doing a concurrent future thread pool for the setup, assuming the underlying setup_node is thread safe?
there currently is no analysis on that
based on a brief look however i would guess that it is absolutely not thread safe
Yeah thats the problem with the current method. A more fundamental change would be required to make it thread safe.
That problem make xdist quite inconvenient to use. I often endup with xdist being slower on machine with more cores than running without parallel in serial. Any workarounds?
No, there's currently no way
Hey gang - is there any planned work around this? It would be amazing if it could yield each worker once set up
this needs some work in execnet, which is currently bus-factored on me being on parernity leave
Ah, that's very fair. I'm assuming this isn't the kind of thing someone can pick up quickly?
It's probably possible, but needs some gut digging, a hack to use a thread pool may be enough
It seems that currently, initializing workers is blocking and done sequentially:
This means that starting a large number of workers even on a single machine with a large number of cpu cores takes a long time. For example, it takes about 45 seconds to start one worker per cpu core on my machine with 88 cores. The effect is even more extreme when workers are being started on remote machines with additional network latency added.
With the advent of a larger number of cores per machine and more frequent access to large clusters of machines, it would be nice to quickly horizontally scale to a large number of workers, even in cases where this would be the difference between 5 minutes on 8 workers versus 8 seconds on 300 workers. Obviously it isn't quite that simple, but there is clearly room theoretically for improvement.
As you can see from the code posted above, i've tried using multiprocessing to paralyze the setup of new nodes, however, the execnet objects used aren't pickleable so this naïve solution does not work.
I've also spent a little bit of time investigating the use of something like
ray
for rapid distribution across an admittedly homogenous cluster, but i ran into trouble whereFunction
objects were not pickleable even by dill and cloudpickle.Has anyone looked into how else this problem could be solved? Perhaps my identification of the problem is also incorrect. Are there other critical factors that are preventing the use of a large number of workers?