rapidsai / ucx-py

Python bindings for UCX
https://ucx-py.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
121 stars 58 forks source link

UCPContext: Separate UCX-py configurations for different libraries running on same process #437

Open cjnolet opened 4 years ago

cjnolet commented 4 years ago

Currently, UCX-py seems to provide a single global context which has to be shared by all the libraries that use ucp in the Python layer. This makes it hard, for example, when one library might want to use a slightly different UCX configuration, or call reset independently from the others.

It also makes it hard when one library might have a failure and might want want to refresh UCX-py, the other library will encounter the same failure.

Further, only one library is allowed to call ucp.init and if another library calls “ucp.reset” then all other libraries will be effected. It would be nice if UCX-py were able to use something like a UCPContext, which could be returned by ucp.init, and the ucp session automatically finalized when it goes out of scope (or some finalize method invoked on the context).

I got the impression from the openUCX ucp c library that multiple ucp_context_h handles could be created per process so this think this should be possible

pentschev commented 4 years ago

I don't know whether it's indeed possible to have multiple UCP contexts within a same process, whereas it seems that you're right, the documentation is not absolutely clear as to whether this is allowed or not. Regardless of whether that is possible or not, could you describe what are the different configurations you would like to have within a same process and why you need them?

cjnolet commented 4 years ago

@pentschev

I should clarify that my use of the term configuration in the justification is referring to the set of endpoints and listeners initialized by each library and not configuration properties, though I believe one side-effect of enabling separate contexts would be that even the configuration properties could be separated if needed.

This need is mostly related to UCX-py being used directly within cuML when a Dask cluster has also been started using protocol='ucx'. With only a single global context, it makes it hard to know who is responsible for calling the lifecycle functions like ucp.init() and ucp.reset(). For example, if a Dask cluster is not started w/ protocol='ucx', is cuML now responsible for initializing and finalizing the ucp session on the process? How do the properties differ between Dask and cuML? Are either going to be initializing ucp with different configuration properties internally? Mostly this last question is asking about Dask, because cuML just relies on the environment variables and default settings at this time.

More importantly, since ucp.reset() will cancel and close all connected endpoints and listeners, it's hard to know when it's safe to call it when both Dask and cuML will each have their own set of endpoints and listeners. All of cuML's distributed algorithms that require point-to-point will use UCX whether or not the underlying Dask cluster is using UCX or not. Should we be checking for this on the cuML side? If cuML is fully reliant on Dask to call ucp.init, is it possible users can expect a slightly different set of default configuration properties with and without protocol='ucx' in Dask?

(Many of these questions demonstrate my ignorance to how exactly UCX-py is being used in Dask, but they kind of demonstrate concerns that arise out of having every library share a single context).

pentschev commented 4 years ago

@cjnolet you have valid concerns and I think we should be more clear as to how things are treated.

First of all, we are currently assuming that Dask will manage the UCP context, when you initialize Dask the comm module will be initialized very early, AFAIK before any other libraries get initialized. Would it be a solution for cuML if you could just check whether there's already a UCP context created and in that case not attempt to ucp.init/ucp.reset?

The creation of multiple contexts is potentially dangerous. To cite one example why that's the case, assume you're enabling NVLink and you rely on RMM. In this case, RMM will create a single allocation which would then need to be mapped with cuIpcOpenMemHandle, and that's only allowed to be mapped once per process.

In fact, this in tangential to this issue, but while writing the paragraph above I just realized that this may be the exact case you're running into in https://github.com/openucx/ucx/issues/3192#issuecomment-594172335. Would you be able to test that again without ucp.init/ucp.reset on the cuML side? Of course, here I'm assuming you're also using Dask in that same issue.

Finally, let me just say that I'm not sure what's the best course of action here, maybe it is to create a new context, maybe it isn't -- I'm more inclined to believe the latter. In either case, I think we need to be better at documentation as this is sure to confuse more people in the future.

cjnolet commented 4 years ago

@pentschev, thank you for taking the time to write-up this explanation!

I wish https://github.com/openucx/ucx/issues/3192#issuecomment-594172335 was related to separately initializing ucp.init / ucp.reset but cuML is not calling these at all currently.

pentschev commented 4 years ago

~I think you've seen this already @cjnolet , but maybe others will benefit. The issue is probably with multiple endpoints, rather than contexts: https://github.com/openucx/ucx/issues/3192#issuecomment-598412255 .~

Sorry, I think I've messed threads up.

pentschev commented 4 years ago

I think the API splitting work from https://github.com/rapidsai/ucx-py/issues/429 may be helpful here. I'm not totally certain about it, but maybe @madsbk could say something about it.

madsbk commented 4 years ago

I think the API splitting work from #429 may be helpful here. I'm not totally certain about it, but maybe @madsbk could say something about it.

Yes, I expect that splitting the API is going to make this feature easy to implement :)