openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 422 forks source link

question about fine-grained transport selection for multi-node env #9560

Open qelk123 opened 9 months ago

qelk123 commented 9 months ago

Hi all, I am utilizing UCX for inter-process communication across multiple nodes in my environment, and I've observed that UCX selects different transport configurations based on the communication pattern. During my testing, the configurations were displayed as follows:

ucp_worker.c:1783 UCX INFO ep_cfg[1]: tag(sysv/memory cma/memory cuda_copy/cuda)

ucp_worker.c:1783 UCX INFO ep_cfg[2]: tag(sysv/memory cma/memory tcp/ib1)

ucp_worker.c:1783 UCX INFO ep_cfg[3]: tag(tcp/ib1 tcp/ib0)

My initial question pertains to the interpretation of these ep_cfg entries. I presume the ep_cfg[0] means the trans to itself, the ep_cfg[1] means the trans between different CPU cores or GPUs for different processes within one server node,and ep_cfg[2] is associated with inter-node communication between different server nodes. Am I interpreting these correctly?

Furthermore, my primary concern is whether I can specify transport constraints individually for different communication patterns (or for different ep_cfg entries). As it stands, I am only able to set constraints globally using UCX_TLS, which affects all ep_cfg entries and could result in suboptimal configurations for certain communication patterns. Is there a way to configure the transport layer with finer granularity for distinct communication patterns?

Regards, Micheal

yosefe commented 9 months ago

The different configurations usually refer to self, intra,inter transports but not always, it can also depend on endpoint creation parameters, different MPI components that use UCX, etc. In newer versions this log message also prints the type of configuration. Currently there is no way to set transport constraint per process topology. In which use case do you see a sub optimal selection by default?