rapidsai / ucx-py

Python bindings for UCX
https://ucx-py.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
115 stars 55 forks source link

Failing NVLINK test #186

Closed quasiben closed 4 years ago

quasiben commented 4 years ago

We have a test which passes data between two processes in PR #174

The cupy test:

py.test -s -v "tests/test_send_recv_two_workers.py::test_send_recv_cudf[cupy]"

passes when using @Akshay-Venkatesh branch https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda-backup but does not pass on then now updated branch https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda

Does this suggest a change in logic in UCX for choosing between IB vs NVLINK/IPC ?

cc @Akshay-Venkatesh @pentschev

pentschev commented 4 years ago

I can confirm that, for me it also works when running https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda-backup, but it doesn't if I run https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda.

pentschev commented 4 years ago

I also checked the same test we were looking at during our conference call earlier today with https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda-backup, and I see that there is some NVLink transfer, which is not the case with the other branch, even though it also hangs apparently at the same point.

One more thing that I don't know if it's still relevant or not, the https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda-backup branch has both workers hanging in the same place, see traceback below:

#0  0x00007f227fae126d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f227fadae42 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f227bb63bd0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f227bcaad60 in cuStreamAddCallback () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f21a5955754 in uct_cuda_ipc_iface_event_fd_arm (tl_iface=0x5615dcba25b0, events=3)
    at ../../../../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:214
#5  0x00007f227ce601ac in uct_iface_event_arm (iface=0x5615dcba25b0, events=3)
    at ../../../src/uct/base/uct_iface.c:178
#6  0x00007f227d0afcd6 in ucp_worker_arm (worker=0x5615dcc96830) at ../../../src/ucp/core/ucp_worker.c:1737
#7  0x00007f227d3f78b7 in ucp_py_worker_progress_wait () at ucp/_libs/src/ucp_py_ucp_fxns.c:420
#8  0x00007f227d3900a8 in __pyx_pf_6ucp_py_12on_activity_cb (__pyx_self=<optimized out>) at ucp/_libs/ucp_py.c:18013
#9  __pyx_pw_6ucp_py_13on_activity_cb (__pyx_self=<optimized out>, unused=<optimized out>)
    at ucp/_libs/ucp_py.c:17916
#10 0x00005615da6a9651 in _PyMethodDef_RawFastCallKeywords ()
    at /home/conda/feedstock_root/build_artifacts/python_1562015400360/work/Objects/call.c:629
// Continues with other Python calls
jakirkham commented 4 years ago

I can confirm that, for me it also works when running https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda-backup, but it doesn't if I run https://github.com/Akshay-Venkatesh/ucx/tree/ucx-cuda.

@pentschev, have you tried running git bisect to identify which commit introduced the problem?

quasiben commented 4 years ago

I've submitted https://github.com/rapidsai/ucx-py/pull/233 to as a tests which should pass when NVLINK is working correctly

quasiben commented 4 years ago

@pentschev and i confirmed that IPC/NVlinK tests are working when using ucx-cuda-backup

pentschev commented 4 years ago

@Akshay-Venkatesh after checking branches with @quasiben, we found out that there's a lot of commits in your ucx-cuda-backup branch that are not contained in ucx-cuda. I started to merge both branches, but I hit a point that there was too much logic that I couldn't really tell which one to keep or how to merge them, since I don't have deep understanding of the code.

Would you be able to merge the branches together so that we can have it in a state where both InfiniBand and NVLink work in the same branch?

jakirkham commented 4 years ago

FWIW there’s no diff between OpenUCX master and ucx-cuda.

It would be nice if there are additional changes in ucx-cuda-backup, to also send them upstream. Would like to start basing our packaging work off of 1.7.0 and upstreaming this content soon would really help. 😁

jakirkham commented 4 years ago

Also @pentschev given you have found a working commit and a not working commit, I’m curious what the results of git bisect yield. Is there a single commit that makes the difference? If so, we could extract that and use it in builds as well as send it upstream for integration. 😀

pentschev commented 4 years ago

Also @pentschev given you have found a working commit and a not working commit, I’m curious what the results of git bisect yield. Is there a single commit that makes the difference? If so, we could extract that and use it in builds as well as send it upstream for integration. 😀

As I said before, the commits in ucx-cuda-backup are not contained in ucx-cuda, therefore, git bisect has no purpose. I think if we merge ucx-cuda-backup in ucx-cuda things will work, but there are some code paths that I can't merge by myself as the diff is too complex for someone without any knowledge of how things work, specifically the changes to sockcm.

jakirkham commented 4 years ago

I don’t think that is accurate. At some point ucx-cuda and ucx-cuda-backup have a common origin. If the common origin does work, then we should be able to find the commit in ucx-cuda that broke things. If the common origin doesn’t work, we should be able to identify the commit in ucx-cuda-backup that fixed things. Either way we should be able to generate a patch from a single commit that reverts the broken pieces or adds the working pieces. This is what git bisect should be able to help with.

pentschev commented 4 years ago

At some point ucx-cuda and ucx-cuda-backup have a common origin.

This is true, I'm not claiming otherwise.

If the common origin does work, then we should be able to find the commit in ucx-cuda that broke things. If the common origin doesn’t work, we should be able to identify the commit in ucx-cuda-backup that fixed things.

This is what I'm saying, I don't think there's a common origin that works, which some commit(s) broke. It's clear that some of the commits in ucx-cuda-backup were never part of ucx-cuda, and the branches should be merged together. We can see the commits that aren't part of the ucx-cuda branch here, and given that many commit messages mention CUDA IPC, that leads me believe those are the commits that add support/fix NVLink. Finally, since the fork of OpenUCX and some of the older commits date back to May and are about 1500 commits behind, I can't really make sense of all the differences myself, so I think @Akshay-Venkatesh will be much quicker (and safer) handling it.

pentschev commented 4 years ago

IOW, finding the commits that work is likely not the problem, merging them into the current code is (for me, but for Akshay, that's likely going to be easy).

pentschev commented 4 years ago

After a long conversation with @Akshay-Venkatesh (thanks very much!), we found out that NVLink on ucx-cuda branch does work. One way to verify that is building that branch with this patch, and following these steps.

I could see on nvprof's output the following line (which is the relevant one):

 GPU activities:  100.00%  4.7071ms       110  42.791us  42.716us  43.900us  [CUDA memcpy PtoP]

I could also see traffic when looking at nvidia-smi nvlink -g 0/nvidia-smi nvlink -g 1.

This seems to point to a bug in ucx-py itself. I have no knowledge of the relevant code, but I believe @quasiben and @madsbk can use this information to help verify that.

jakirkham commented 4 years ago

One thing Akshay pointed out is that using UCX in blocking mode seems to exhibit this behavior. I'm not sure what would be involved to use UCX in non-blocking mode from ucx-py, but that would presumably fix this issue.

quasiben commented 4 years ago

i just test with latest of ucx and confusingly I saw traffic on nvlink. This is the version of ucx i am on:

(cudf_dev10.1) bzaitlen@dgx15:~/GitRepos/ucx-py$ ucx_info -v
# UCT version=1.8.0 revision 4042503
# configured with: --prefix=/home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1 --enable-debug --with-cuda=/usr/local/cuda-10.1/ --enable-mt CPPFLAGS=-I///usr/local/cuda-10.1//include
(cudf_dev10.1) bzaitlen@dgx15:~/GitRepos/ucx-py$ ucx_info -v

and i ran this command:

UCXPY_IFNAME=enp1s0f0 CUDA_VISIBLE_DEVICES=0,1 UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm py.test -s -v "tests/test_send_recv_two_workers.py::test_send_recv_cudf[dataframe]"

jakirkham commented 4 years ago

IIUC this is expected. Namely UCX already supports NVLINK. Our issues with NVLINK are not UCX related, but ucx-py related. So we will need to figure out what in ucx-py is causing us issues.

The lead Akshay gave us (assuming I’m not missing something) was to get ucx-py to use UCX in non-blocking mode. @madsbk , do you have a sense of what this would take?

quasiben commented 4 years ago

What I'm saying, though could be wrong, is that it looks like ucx-py is working now -- passing data back and forth across two GPUs over nvlink. Previously, the nvlink test test_send_recv_two_workers.py was not working and now it is and I don't know why exactly

pentschev commented 4 years ago

@madsbk and I tried to do some more debugging this morning, our finding is that ucx master also works with ucx-py, so the ucx-cuda branch is the only one that doesn't work with ucx-py. We are both in accordance that we should use ucx master in favor of any other branches and continue with that.

It is still not clear why ucx-cuda branch is the only one where ucx-py doesn't work with. For me personally this is not relevant at this point, but I'll leave this as an open question in case someone thinks it's important to check that.

After my debugging session with @madsbk, I did some more tests with dask-cuda and was able to verify that ucx master works in all combinations (TCP only, InfiniBand only, NVLink only and InfiniBand+NVLink), which is what I expected to see.

I will tentatively close this issue, but feel free to reopen if there are still unanswered questions that should be discussed.

Akshay-Venkatesh commented 4 years ago

@jakirkham it's wrong to conclude that non-blocking mode should be used. The test I provided concluded that UCX works fine in blocking mode. Blocking mode is needed to support coroutine based execution. If you guys have other means to support that, please enlighten me

jakirkham commented 4 years ago

our finding is that ucx master also works with ucx-py, so the ucx-cuda branch is the only one that doesn't work with ucx-py.

There is no difference between ucx-cuda and master though. The branches (last I checked) were identical.

jakirkham commented 4 years ago

@jakirkham it's wrong to conclude that non-blocking mode should be used. The test I provided concluded that UCX works fine in blocking mode.

Thanks for correcting me.

Blocking mode is needed to support coroutine based execution. If you guys have other means to support that, please enlighten me

Does this mean we should be using blocking mode? IIUC @madsbk ’s offline comment correctly we are actually using non-blocking mode. Is this right @madsbk?

jakirkham commented 4 years ago

Should add ATM I’m very confused about what the cause of this problem was and how it was fixed. It would be great if we could get some clarity on the cause.

Akshay-Venkatesh commented 4 years ago

The code points to blocking mode use -

https://github.com/rapidsai/ucx-py/blob/ad25135fea4ef68644efe65ee111ca0dc254adf3/ucp/_libs/core.pyx#L356

madsbk commented 4 years ago

@Akshay-Venkatesh what do you with non-blocking mode exactly?

Is non-blocking mode where we wait using epoll? and blocking mode is when we would use active polling?

madsbk commented 4 years ago

The code points to blocking mode use -

https://github.com/rapidsai/ucx-py/blob/ad25135fea4ef68644efe65ee111ca0dc254adf3/ucp/_libs/core.pyx#L356

OK I see, in that case we are using blocking mode in both the old and the new ucx-py :)

jakirkham commented 4 years ago

Ok so we are using blocking mode, but that is ok (I misunderstood @Akshay-Venkatesh earlier).

jakirkham commented 4 years ago

Am still unclear on why NVLINK did not work previously, but now works.

jakirkham commented 4 years ago

@pentschev , would you be able to come up with a few tests for NVLINK with ucx-py? Hopefully we can integrate these so we don’t accidentally lose this functionality later.

pentschev commented 4 years ago

@pentschev , would you be able to come up with a few tests for NVLINK with ucx-py? Hopefully we can integrate these so we don’t accidentally lose this functionality later.

@quasiben already did that in https://github.com/rapidsai/ucx-py/pull/233