openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 425 forks source link

Documentation for environment variables #4273

Open pentschev opened 5 years ago

pentschev commented 5 years ago

I have been trying to understand what the various environment variables options mean, but this seems to be mostly undocumented, with the exception of a few that I could find on the wiki here and here.

Only as examples, I can't find documentation on what rdmacm and sockcm, even though the names are self-explanatory to some extent. As a regular user (not a OpenUCX developer deeply involved with the project), I have the feeling it is difficult to understand what they mean exactly, nor how/when should I use them.

In general, what I would expect to see from documentation is:

  1. What each parameter for environment variables mean;
  2. When to use them (e.g., is specific hardware needed, are parameters exclusive, etc.).

The GitHub wiki is also not particularly good for navigation/searching, which makes it even more difficult to find information. As an example, if you go to the wiki and "Find a page", it will only find keywords if they're contained in the page's title, not their content.

Is this something that has been worked already, and if not, is there a timeline for it?

In case this is something that already exists and I just did not manage to find, could you point me to that?

cc @Akshay-Venkatesh @quasiben @madsbk @mrocklin

mrocklin commented 5 years ago

This is especially important as we start to broaden the use of UCX to less-sophisticated Python users with UCX-Py and Dask. If this is successful then we might get many more folks with much less experience asking questions like "What does UCX_TLS mean?" and "How do I set up my environment variables so that things work?". I anticipate getting a lot of these questions on the Dask issue tracker (we're getting them now internally) and it would be great to have a good resource to which we can direct them.

brminich commented 5 years ago

You can use:

mrocklin commented 5 years ago

I'm glad to hear it and thank you for the timely response. However, the potential users that we're talking about may not be sufficiently technical to do even this. For them, it would be best to point them to a public webpage.

Is serving online documentation in-scope for the UCX project?

brminich commented 5 years ago

@shamisp, can you please comment?

pentschev commented 5 years ago

Thanks for the reply @brminich. I was looking through the output of ucx_info -f and the docs, and while they are helpful, I would say they are both incomplete. For instance, none of them really answer what is rdmacm or sockcm, nor if they can be used together or they are exclusive. Please note that I'm not trying to make a case for these two parameters in particular, I'm only using them as examples.

In general, I think from a user's perspective (like mine), the documentation is not enough to get started with OpenUCX without guidance from developers.

In fact, my experience so far was that I had to ask @Akshay-Venkatesh many times about what parameters I should use, and I can't really understand why I'm using some of them at the moment. I think a more robust documentation may generally help users getting started with OpenUCX, and potentially prevent them from asking questions similar to this one on GitHub in the future.

shamisp commented 5 years ago

Indeed we can do better job in respect to documentation. Historically we have been very focused on API documentation and documentation of the library parameters is a bit neglected.

Generally speaking, UCX supposed to pick up connection manager (rdmacm/socketcm), and other “best” tunings automatically without user intervention. First of all I would like to figure out why users have to select CM explicitly. Second, probably we can start gathering more complex use cases and starting documenting those on Wiki.

Adding @tonycurtis , who has been working on man pages. Adding @hiroyuki-sato, who has been documenting bunch of useful stuff related to UCX

Probably it is good time starting consolidating everything on wiki.

mrocklin commented 5 years ago

Generally speaking, UCX supposed to pick up connection manager (rdmacm/socketcm), and other “best” tunings automatically without user intervention. First of all I would like to figure out why users have to select CM explicitly.

Agreed. I raised a similar issue downstream here: https://github.com/rapidsai/ucx-py/issues/245

Probably it is good time starting consolidating everything on wiki.

Sounds good. Eventually, you might also consider hosted docs somewhere. Personally, I and much of the Python community enjoy using readthedocs.org . It's pretty trivial to turn markdown or rst files into hosted documentation that gets updated whenever you push to github. I find that it's about a 20-minute investment to set up, and that developers find it motivating when you shorten the time between writing documentation source files and having them live on the web.

yosefe commented 5 years ago

UCX environment variables are intended mostly for expert/intermediate level debugging and tuning, or as workarounds for issues found by users. The general approach is that users should not set any environment variables to make UCX work. For example, regular users should not set UCX_TLS and should not be aware of sockcm/rdmacm existence.

@mrocklin @Akshay-Venkatesh did you have cases where UCX would not work without specific environment variables?

hiroyuki-sato commented 5 years ago

I'm creating OpenUCX glossaries. https://github.com/hiroyuki-sato/openucx-docs/blob/master/glossaries.txt

(And also I'm creating ucx_info example output) https://github.com/hiroyuki-sato/openucx-docs/blob/master/ucx_info.md

These document are always PR ready. If anyone want to use it, Please let me know.

mrocklin commented 5 years ago

did you have cases where UCX would not work without specific environment variables?

@pentschev maybe you can say a bit here about what you've had to specify to make things work?

pentschev commented 5 years ago

Yes, here's a list of what we have to specify right now (and my understanding of their functionality):

mrocklin commented 5 years ago

Do you have a sense of why these are necessary? For example, what happens if we specify UCX_TLS=all or don't specify UCX_NET_DEVICES at all?

pentschev commented 5 years ago

I don't recall the exact behavior with UCX_TLS=all, so I will need to check that again. Without specifying UCX_NET_DEVICES and with InfiniBand enabled, there would be no transfer at the expected interface, but rather a random one which would change depending on the machine.

mrocklin commented 5 years ago

My sense is that if we give the OpenUCX team more detailed feedback about what fails when we don't specify environment variables then they might be able to help us understand why that is, or hopefully fix things upstream.

pentschev commented 5 years ago

Totally, and that's part of why I'm asking about documentation. I feel that if I understand what the parameters really mean, I could be more specific on tests I do and whether the behavior matches what we would expect to happen.

yosefe commented 5 years ago

@pentschev I think that UCX_NET_DEVICES is the only variable needs to be documented for users (we should probably improve this page. Can you try not specifying UCT_TLS? the default value is "all", which means all transports are enabled by default.

BTW what is the use case for selecting specific devices? If the intent is to use the device closes to a particular GPU, for example, than we are planning to fix UCX to use closest device by default according to user buffer memory locality, today it's done by CPU locality.

Akshay-Venkatesh commented 5 years ago

@pentschev I think that UCX_NET_DEVICES is the only variable needs to be documented for users (we should probably improve this page. Can you try not specifying UCT_TLS? the default value is "all", which means all transports are enabled by default.

BTW what is the use case for selecting specific devices? If the intent is to use the device closes to a particular GPU, for example, than we are planning to fix UCX to use closest device by default according to user buffer memory locality, today it's done by CPU locality.

+1

pentschev commented 5 years ago

First, sorry for the delay to reply here. I had the chance to try this now, and using UCX_TLS=all works. However, there's a difference in bandwidth, with ucx_perftest, I see that UCX_TLS=all is faster than when specifying all variables (UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm to keep it minimalistic):

 CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571261708.999172] [dgx13:38020:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
           100     0.000    54.829    54.829   17393.57   17393.57       18238       18238
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571262244.045134] [dgx13:60481:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
           100     0.000    76.861    76.861   12407.72   12407.72       13010       13010

Interestingly, @quasiben has been updating a benchmark we have for UCX-Py https://github.com/rapidsai/ucx-py/pull/254, where I see the opposite behavior, where UCX_TLS=all is slower:

CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all python benchmarks/recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13338 -s 10.33.227.163
CUDA RUNTIME DEVICE:  0
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
19.35 GB / s
===================
[1571261981.226035] [dgx13:47668:0]          rc_ep.c:321  UCX  WARN  destroying rc ep 0x564691960878 with uncompleted operation 0x564694cfaf00
[1571261981.462779] [dgx13:47668:0]          mpool.c:43   UCX  WARN  object 0x5646926c4d80 was not returned to mpool ucp_requests
[1571261981.462797] [dgx13:47668:0]      callbackq.c:447  UCX  WARN  0 fast-path and 1 slow-path callbacks remain in the queue
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python benchmarks/recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13338 -s 10.33.227.
163
CUDA RUNTIME DEVICE:  0
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
23.72 GB / s
===================

We also see some errors during destruction in our UCX-Py benchmark when using UCX_TLS=all that we don't when specifying the parameters. Any ideas on what could be the reason for both the bandwidth difference and errors that we're seeing?

Akshay-Venkatesh commented 5 years ago

One of the differences is that ucxpy runs in blocking mode where as ucx_perftest runs in polling mode. Maybe you can report blocking ucx_perftest with a patch I've given previously.

jakirkham commented 5 years ago

Pasting the link to Akshay’s patch for completeness.

ref: https://gist.github.com/Akshay-Venkatesh/f7ba35d13410da60f5e131afa0a738eb

pentschev commented 5 years ago

Using the blocking mode patch, this is what I get now:

CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571300259.760885] [dgx13:48924:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
           100     0.000    91.431    91.431   10430.52   10430.52       10937       10937
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571300211.884216] [dgx13:46822:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
           100     0.000   127.230   127.230    7495.69    7495.69        7860        7860

Just as in the non-blocking mode, UCX_TLS=all performs better than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm.

I think the more important thing is first to understand why specifying UCX_TLS=all for ucx_perftest is so much faster, this may help us in understand if we're doing something wrong in UCX-Py.

I don't know if this is potentially related to the bandwidth computation, but I notice cudaMalloc is also considerably faster when running nvprof with UCX_TLS=all (result below in non-blocking mode, but blocking mode shows similar call times):

 CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=all /usr/local/cuda/bin/nvprof ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571301256.441254] [dgx13:10600:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==10600== NVPROF is profiling process 10600, command: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
           100     0.000    62.861    62.861   15171.05   15171.05       15908       15908
==10600== Profiling application: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
==10600== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  4.7046ms       110  42.769us  42.715us  44.475us  [CUDA memcpy PtoP]
      API calls:   97.26%  456.51ms         2  228.25ms  7.5950us  456.50ms  cudaMalloc
CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm /usr/local/cuda/bin/nvprof ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
[1571301464.867149] [dgx13:19129:0]       perftest.c:1376 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==19129== NVPROF is profiling process 19129, command: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
           100     0.000    96.030    96.030    9930.98    9930.98       10413       10413
==19129== Profiling application: ucx_perftest -t tag_bw -m cuda -s 1000000 -n 100 dgx13
==19129== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  4.7065ms       110  42.786us  42.715us  44.796us  [CUDA memcpy PtoP]
      API calls:   97.26%  658.65ms         2  329.33ms  7.8960us  658.65ms  cudaMalloc
Akshay-Venkatesh commented 5 years ago

UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.

Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.

Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.

pentschev commented 5 years ago

UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.

Is it possible to simulate that by passing the proper variables so we can really see how they affect execution and perhaps understand them better?

Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.

Sure, I don't think they are the critical path either, I just found it curious that there's a substantial and consistent difference among multiple runs on the cudaMalloc time, as I wouldn't expect to see any changes in memory allocation time.

Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.

I just did that and the only difference is also memory allocation, where UCX_TLS=all is still somewhat faster than specifying parameters manually.

Akshay-Venkatesh commented 5 years ago

UCX_TLS=all being faster isn't surprising because rc, sm transports will be used for control message exchange for rndv messages. Otherwise just tcp is used.

Is it possible to simulate that by passing the proper variables so we can really see how they affect execution and perhaps understand them better?

It may be possible but I don't know how straight-forward that is.

Cudamalloc isn't in the critical path of bandwidth measurement and there are just 2 calls. Ignore the difference.

Sure, I don't think they are the critical path either, I just found it curious that there's a substantial and consistent difference among multiple runs on the cudaMalloc time, as I wouldn't expect to see any changes in memory allocation time.

Interesting but not sure why.

Nvprof ucxpy benchmark maybe able to point to why you get 23GBps without tls=all as opposed to 19GBps with it.

I just did that and the only difference is also memory allocation, where UCX_TLS=all is still somewhat faster than specifying parameters manually.

Wait. Are you saying UCX_TLS=all is faster than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm now for the ucxpy benchmark? The original issue says UCX_TLS=all is slower.

If you meant UCX_TLS=all is slower, I just recollected that previously we had seen cases within the node (where NVLINK was applicable) that removing IB UCTs resulted in performance gains. For this configuration @nsakharnykh and his intern found out that disabling IB resulted in time savings from not registering memory for IB which was taking up considerable time. You can run nvprof with cpu-profiling turned on and check if this is the case you're seeing as well. UCX will try and register with all transports that are capable of moving data so IB registration occurs even though just CUDA-IPC is used for moving data.

pentschev commented 5 years ago

Wait. Are you saying UCX_TLS=all is faster than UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm now for the ucxpy benchmark? The original issue says UCX_TLS=all is slower.

Sorry, my mistake. UCX_TLS=all is slower for ucx-py, but faster for ucx_perftest.

If you meant UCX_TLS=all is slower, I just recollected that previously we had seen cases within the node (where NVLINK was applicable) that removing IB UCTs resulted in performance gains. For this configuration @nsakharnykh and his intern found out that disabling IB resulted in time savings from not registering memory for IB which was taking up considerable time. You can run nvprof with cpu-profiling turned on and check if this is the case you're seeing as well. UCX will try and register with all transports that are capable of moving data so IB registration occurs even though just CUDA-IPC is used for moving data.

Thanks for the hint, I will try that out today and report results back here.

pentschev commented 5 years ago

@Akshay-Venkatesh have you used nvprof's cpu-profiling with UCX before? If so, could you tell me whether I should set any specific flags? I've been trying for quite some time already with --cpu-profiling on and the process always hangs. I've been trying now with ucx_perftest, additionally I've tried decreasing the sampling frequency and exporting results to file, but the process always hangs. Just to be clear, hangs means 10-30 minutes at least.