Open pentschev opened 5 years ago
This is especially important as we start to broaden the use of UCX to less-sophisticated Python users with UCX-Py and Dask. If this is successful then we might get many more folks with much less experience asking questions like "What does UCX_TLS mean?" and "How do I set up my environment variables so that things work?". I anticipate getting a lot of these questions on the Dask issue tracker (we're getting them now internally) and it would be great to have a good resource to which we can direct them.
You can use:
ucx_info -f
- it'd print all possible env variables with brief description and their possible values./configure --with-docs-only && make docs
I'm glad to hear it and thank you for the timely response. However, the potential users that we're talking about may not be sufficiently technical to do even this. For them, it would be best to point them to a public webpage.
Is serving online documentation in-scope for the UCX project?
@shamisp, can you please comment?
Thanks for the reply @brminich. I was looking through the output of ucx_info -f
and the docs, and while they are helpful, I would say they are both incomplete. For instance, none of them really answer what is rdmacm
or sockcm
, nor if they can be used together or they are exclusive. Please note that I'm not trying to make a case for these two parameters in particular, I'm only using them as examples.
In general, I think from a user's perspective (like mine), the documentation is not enough to get started with OpenUCX without guidance from developers.
In fact, my experience so far was that I had to ask @Akshay-Venkatesh many times about what parameters I should use, and I can't really understand why I'm using some of them at the moment. I think a more robust documentation may generally help users getting started with OpenUCX, and potentially prevent them from asking questions similar to this one on GitHub in the future.
Indeed we can do better job in respect to documentation. Historically we have been very focused on API documentation and documentation of the library parameters is a bit neglected.
Generally speaking, UCX supposed to pick up connection manager (rdmacm/socketcm), and other “best” tunings automatically without user intervention. First of all I would like to figure out why users have to select CM explicitly. Second, probably we can start gathering more complex use cases and starting documenting those on Wiki.
Adding @tonycurtis , who has been working on man pages. Adding @hiroyuki-sato, who has been documenting bunch of useful stuff related to UCX
Probably it is good time starting consolidating everything on wiki.
Generally speaking, UCX supposed to pick up connection manager (rdmacm/socketcm), and other “best” tunings automatically without user intervention. First of all I would like to figure out why users have to select CM explicitly.
Agreed. I raised a similar issue downstream here: https://github.com/rapidsai/ucx-py/issues/245
Probably it is good time starting consolidating everything on wiki.
Sounds good. Eventually, you might also consider hosted docs somewhere. Personally, I and much of the Python community enjoy using readthedocs.org . It's pretty trivial to turn markdown or rst files into hosted documentation that gets updated whenever you push to github. I find that it's about a 20-minute investment to set up, and that developers find it motivating when you shorten the time between writing documentation source files and having them live on the web.
UCX environment variables are intended mostly for expert/intermediate level debugging and tuning, or as workarounds for issues found by users. The general approach is that users should not set any environment variables to make UCX work. For example, regular users should not set UCX_TLS and should not be aware of sockcm/rdmacm existence.
@mrocklin @Akshay-Venkatesh did you have cases where UCX would not work without specific environment variables?
I'm creating OpenUCX glossaries. https://github.com/hiroyuki-sato/openucx-docs/blob/master/glossaries.txt
(And also I'm creating ucx_info
example output)
https://github.com/hiroyuki-sato/openucx-docs/blob/master/ucx_info.md
These document are always PR ready. If anyone want to use it, Please let me know.
did you have cases where UCX would not work without specific environment variables?
@pentschev maybe you can say a bit here about what you've had to specify to make things work?
Yes, here's a list of what we have to specify right now (and my understanding of their functionality):
UCX_TLS=rc
(enables InfiniBand)UCX_TLS=tcp,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm
(allows establishing connection via TCP sockets)UCX_TLS=cuda_copy
(allows CUDA transfer)UCT_TLS=cuda_ipc
(enables NVLink)UCX_NET_DEVICES=mlx5_X:1
(where X
is the InfiniBand interface number; needs to be set for each process running on a specific GPU so they know the closest IB interface to use)Do you have a sense of why these are necessary? For example, what happens if we specify UCX_TLS=all
or don't specify UCX_NET_DEVICES
at all?
I don't recall the exact behavior with UCX_TLS=all
, so I will need to check that again. Without specifying UCX_NET_DEVICES
and with InfiniBand enabled, there would be no transfer at the expected interface, but rather a random one which would change depending on the machine.
My sense is that if we give the OpenUCX team more detailed feedback about what fails when we don't specify environment variables then they might be able to help us understand why that is, or hopefully fix things upstream.
Totally, and that's part of why I'm asking about documentation. I feel that if I understand what the parameters really mean, I could be more specific on tests I do and whether the behavior matches what we would expect to happen.
@pentschev I think that UCX_NET_DEVICES is the only variable needs to be documented for users (we should probably improve this page. Can you try not specifying UCT_TLS? the default value is "all", which means all transports are enabled by default.
BTW what is the use case for selecting specific devices? If the intent is to use the device closes to a particular GPU, for example, than we are planning to fix UCX to use closest device by default according to user buffer memory locality, today it's done by CPU locality.
@pentschev I think that UCX_NET_DEVICES is the only variable needs to be documented for users (we should probably improve this page. Can you try not specifying UCT_TLS? the default value is "all", which means all transports are enabled by default.
BTW what is the use case for selecting specific devices? If the intent is to use the device closes to a particular GPU, for example, than we are planning to fix UCX to use closest device by default according to user buffer memory locality, today it's done by CPU locality.
+1
First, sorry for the delay to reply here. I had the chance to try this now, and using UCX_TLS=all
works. However, there's a difference in bandwidth, with ucx_perftest
, I see that UCX_TLS=all
is faster than when specifying all variables (UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm
to keep it minimalistic):
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571261708.999172] [dgx13:38020:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
100 0.000 54.829 54.829 17393.57 17393.57 18238 18238
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571262244.045134] [dgx13:60481:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
100 0.000 76.861 76.861 12407.72 12407.72 13010 13010
Interestingly, @quasiben has been updating a benchmark we have for UCX-Py https://github.com/rapidsai/ucx-py/pull/254, where I see the opposite behavior, where UCX_TLS=all
is slower:
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all python benchmarks/recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13338 -s 10.33.227.163
CUDA RUNTIME DEVICE: 0
Roundtrip benchmark
-------------------
n_iter | 10
n_bytes | 1000.00 MB
recv | recv_into
object | cupy
inc | False
===================
19.35 GB / s
===================
[1571261981.226035] [dgx13:47668:0] rc_ep.c:321 UCX WARN destroying rc ep 0x564691960878 with uncompleted operation 0x564694cfaf00
[1571261981.462779] [dgx13:47668:0] mpool.c:43 UCX WARN object 0x5646926c4d80 was not returned to mpool ucp_requests
[1571261981.462797] [dgx13:47668:0] callbackq.c:447 UCX WARN 0 fast-path and 1 slow-path callbacks remain in the queue
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python benchmarks/recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13338 -s 10.33.227.
163
CUDA RUNTIME DEVICE: 0
Roundtrip benchmark
-------------------
n_iter | 10
n_bytes | 1000.00 MB
recv | recv_into
object | cupy
inc | False
===================
23.72 GB / s
===================
We also see some errors during destruction in our UCX-Py benchmark when using UCX_TLS=all
that we don't when specifying the parameters. Any ideas on what could be the reason for both the bandwidth difference and errors that we're seeing?
One of the differences is that ucxpy runs in blocking mode where as ucx_perftest runs in polling mode. Maybe you can report blocking ucx_perftest with a patch I've given previously.
Pasting the link to Akshay’s patch for completeness.
ref: https://gist.github.com/Akshay-Venkatesh/f7ba35d13410da60f5e131afa0a738eb
Using the blocking mode patch, this is what I get now:
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571300259.760885] [dgx13:48924:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
100 0.000 91.431 91.431 10430.52 10430.52 10937 10937
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571300211.884216] [dgx13:46822:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
100 0.000 127.230 127.230 7495.69 7495.69 7860 7860
Just as in the non-blocking mode, UCX_TLS=all
performs better than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm
.
I think the more important thing is first to understand why specifying UCX_TLS=all
for ucx_perftest
is so much faster, this may help us in understand if we're doing something wrong in UCX-Py.
I don't know if this is potentially related to the bandwidth computation, but I notice cudaMalloc
is also considerably faster when running nvprof
with UCX_TLS=all
(result below in non-blocking mode, but blocking mode shows similar call times):
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all /usr/local/cuda/bin/nvprof ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571301256.441254] [dgx13:10600:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==10600== NVPROF is profiling process 10600, command: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
100 0.000 62.861 62.861 15171.05 15171.05 15908 15908
==10600== Profiling application: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
==10600== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 4.7046ms 110 42.769us 42.715us 44.475us [CUDA memcpy PtoP]
API calls: 97.26% 456.51ms 2 228.25ms 7.5950us 456.50ms cudaMalloc
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm /usr/local/cuda/bin/nvprof ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571301464.867149] [dgx13:19129:0] perftest.c:1376 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==19129== NVPROF is profiling process 19129, command: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
100 0.000 96.030 96.030 9930.98 9930.98 10413 10413
==19129== Profiling application: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
==19129== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 4.7065ms 110 42.786us 42.715us 44.796us [CUDA memcpy PtoP]
API calls: 97.26% 658.65ms 2 329.33ms 7.8960us 658.65ms cudaMalloc
UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.
Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.
Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.
UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.
Is it possible to simulate that by passing the proper variables so we can really see how they affect execution and perhaps understand them better?
Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.
Sure, I don't think they are the critical path either, I just found it curious that there's a substantial and consistent difference among multiple runs on the cudaMalloc
time, as I wouldn't expect to see any changes in memory allocation time.
Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.
I just did that and the only difference is also memory allocation, where UCX_TLS=all
is still somewhat faster than specifying parameters manually.
UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.
Is it possible to simulate that by passing the proper variables so we can really see how they affect execution and perhaps understand them better?
It may be possible but I don't know how straight-forward that is.
Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.
Sure, I don't think they are the critical path either, I just found it curious that there's a substantial and consistent difference among multiple runs on the
cudaMalloc
time, as I wouldn't expect to see any changes in memory allocation time.
Interesting but not sure why.
Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.
I just did that and the only difference is also memory allocation, where
UCX_TLS=all
is still somewhat faster than specifying parameters manually.
Wait. Are you saying UCX_TLS=all
is faster than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm
now for the ucxpy benchmark? The original issue says UCX_TLS=all
is slower.
If you meant UCX_TLS=all
is slower, I just recollected that previously we had seen cases within the node (where NVLINK was applicable) that removing IB UCTs resulted in performance gains. For this configuration @nsakharnykh and his intern found out that disabling IB resulted in time savings from not registering memory for IB which was taking up considerable time. You can run nvprof with cpu-profiling turned on and check if this is the case you're seeing as well. UCX will try and register with all transports that are capable of moving data so IB registration occurs even though just CUDA-IPC is used for moving data.
Wait. Are you saying UCX_TLS=all is faster than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm now for the ucxpy benchmark? The original issue says UCX_TLS=all is slower.
Sorry, my mistake. UCX_TLS=all
is slower for ucx-py, but faster for ucx_perftest
.
If you meant UCX_TLS=all is slower, I just recollected that previously we had seen cases within the node (where NVLINK was applicable) that removing IB UCTs resulted in performance gains. For this configuration @nsakharnykh and his intern found out that disabling IB resulted in time savings from not registering memory for IB which was taking up considerable time. You can run nvprof with cpu-profiling turned on and check if this is the case you're seeing as well. UCX will try and register with all transports that are capable of moving data so IB registration occurs even though just CUDA-IPC is used for moving data.
Thanks for the hint, I will try that out today and report results back here.
@Akshay-Venkatesh have you used nvprof's cpu-profiling with UCX before? If so, could you tell me whether I should set any specific flags? I've been trying for quite some time already with --cpu-profiling on
and the process always hangs. I've been trying now with ucx_perftest
, additionally I've tried decreasing the sampling frequency and exporting results to file, but the process always hangs. Just to be clear, hangs means 10-30 minutes at least.
I have been trying to understand what the various environment variables options mean, but this seems to be mostly undocumented, with the exception of a few that I could find on the wiki here and here.
Only as examples, I can't find documentation on what
rdmacm
andsockcm
, even though the names are self-explanatory to some extent. As a regular user (not a OpenUCX developer deeply involved with the project), I have the feeling it is difficult to understand what they mean exactly, nor how/when should I use them.In general, what I would expect to see from documentation is:
The GitHub wiki is also not particularly good for navigation/searching, which makes it even more difficult to find information. As an example, if you go to the wiki and "Find a page", it will only find keywords if they're contained in the page's title, not their content.
Is this something that has been worked already, and if not, is there a timeline for it?
In case this is something that already exists and I just did not manage to find, could you point me to that?
cc @Akshay-Venkatesh @quasiben @madsbk @mrocklin