Running dask_cudf on 2 GPUs and everything hangs

pseudotensor commented 4 years ago

See for setup details: https://github.com/dmlc/xgboost/issues/6232

Last thing in logs before hang is:

[1603364135.700317] [mr-dl10:19153:0]           sock.c:344  UCX  ERROR recv(fd=145) failed: Connection reset by peer

If I poke the faulthandler in python to see where things are hung for a process using 100% of 1 core, I see:

Thread 0x000014b686414700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/popen_fork.py", line 50 in wait
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/process.py", line 124 in join
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/process.py", line 232 in _watch_process
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b671fff700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 295 in wait
  File "/home/jon/minicondadai/lib/python3.6/queue.py", line 164 in get
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/process.py", line 217 in _watch_message_queue
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b671dfe700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/popen_fork.py", line 50 in wait
  File "/home/jon/minicondadai/lib/python3.6/multiprocessing/process.py", line 124 in join
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/process.py", line 232 in _watch_process
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b6717fb700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 295 in wait
  File "/home/jon/minicondadai/lib/python3.6/queue.py", line 164 in get
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/process.py", line 217 in _watch_message_queue
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b6719fc700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/profile.py", line 269 in _watch
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b671bfd700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/contextlib.py", line 157 in helper
  File "/home/jon/minicondadai/lib/python3.6/contextlib.py", line 52 in inner
  File "/home/jon/minicondadai/lib/python3.6/site-packages/ucp/continuous_ucx_progress.py", line 81 in _fd_reader_callback
  File "/home/jon/minicondadai/lib/python3.6/asyncio/events.py", line 145 in _run
  File "/home/jon/minicondadai/lib/python3.6/asyncio/base_events.py", line 1462 in _run_once
  File "/home/jon/minicondadai/lib/python3.6/asyncio/base_events.py", line 442 in run_forever
  File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 149 in start
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 418 in run_loop
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b686dd5700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 295 in wait
  File "/home/jon/minicondadai/lib/python3.6/queue.py", line 164 in get
  File "/home/jon/minicondadai/lib/python3.6/concurrent/futures/thread.py", line 67 in _worker
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b7948bd700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/profile.py", line 269 in _watch
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x000014b78662c700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/contextlib.py", line 52 in inner
  File "/home/jon/minicondadai/lib/python3.6/site-packages/ucp/continuous_ucx_progress.py", line 81 in _fd_reader_callback
  File "/home/jon/minicondadai/lib/python3.6/asyncio/events.py", line 145 in _run
  File "/home/jon/minicondadai/lib/python3.6/asyncio/base_events.py", line 1462 in _run_once
  File "/home/jon/minicondadai/lib/python3.6/asyncio/base_events.py", line 442 in run_forever
  File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 149 in start
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 418 in run_loop
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864 in run
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x000014b8a3dfe700 (most recent call first):
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 299 in wait
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 551 in wait
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 336 in sync
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/deploy/cluster.py", line 163 in sync
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/deploy/cluster.py", line 364 in __exit__
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/deploy/spec.py", line 409 in __exit__
  File "/data/jon/h2oai.fullcondatest3/h2oaicore/train.py", line 1428 in predict

I run dask_cudf like:

import pandas as pd
os.environ["DASK_RMM__POOL_SIZE"] = "1GB"
os.environ["DASK_UCX__CUDA_COPY"] = "True"  # os.environ needs using strings, not Python True/False
os.environ["DASK_UCX__TCP"] = "True"
os.environ["DASK_UCX__NVLINK"] = "False"
os.environ["DASK_UCX__INFINIBAND"] = "False"

def fit():

    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    kwargs = dict(n_workers=None, threads_per_worker=1, processes=True, memory_limit='auto', device_memory_limit=None, CUDA_VISIBLE_DEVICES=None, data=None, local_directory=None, protocol='ucx', enable_tcp_over_ucx=True, enable_infiniband=False, enable_nvlink=False, enable_rdmacm=False, ucx_net_devices='auto', rmm_pool_size='1GB')

    with LocalCUDACluster(**kwargs) as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard.csv")
            X = dask_cudf.read_csv("creditcard.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.fit(X, y, verbose=True)

def predict():

    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    kwargs = dict(n_workers=None, threads_per_worker=1, processes=True, memory_limit='auto', device_memory_limit=None, CUDA_VISIBLE_DEVICES=None, data=None, local_directory=None, protocol='ucx', enable_tcp_over_ucx=True, enable_infiniband=False, enable_nvlink=False, enable_rdmacm=False, ucx_net_devices='auto', rmm_pool_size='1GB')

    with LocalCUDACluster(**kwargs) as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard.csv")
            X = dask_cudf.read_csv("creditcard.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.predict_proba(X)

if __name__ == '__main__':
    model = fit()
    preds = predict()

I'm only giving a schematic of the code. It is not an MRE yet. But, for me here the hang is during the predict.

I'm new to using ucx, and hadn't seen this kind of hang before when using default options.

Any ideas?

Thanks!

quasiben commented 4 years ago

When you run with just TCP do you still see this error ?

pentschev commented 4 years ago

When using LocalCUDACluster, the environment variables you have at the top are redundant, I would suggest removing them and just using the kwargs from LocalCUDACluster. I also see you're using ucx_net_devices="auto" without IB, please see refer to this warning: https://github.com/rapidsai/dask-cuda/blob/302d1b8d422dbb981a0e36b1c7f14941cfd80ef7/dask_cuda/local_cuda_cluster.py#L96-L101

I would suggest you try ucx_net_devices="eth0", replacing "eth0" to the Ethernet interface your system is using. Your RMM pool size looks also very small, you're probably better off with a pool that's 90-95% of your GPU's memory, although I'm not sure if xgboost is using RMM or if it'll just compete for memory with Dask-CUDA, in that case you may see other issues. It would also help if you post the output of conda list so we can see what versions of packages you're using.

pseudotensor commented 4 years ago

@pentschev Thanks, I assumed they were redundant, except some didn't match, like cuda_copy one is not in the kwargs explicitly at least.

I was just trying to follow your docs as closely as possible, where there is also such redundancy:

https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only

pseudotensor commented 4 years ago

When you run with just TCP do you still see this error ?

No such hangs or errors appear in tcp mode. Note the message is only about ucx.

pseudotensor commented 4 years ago

When using LocalCUDACluster, the environment variables you have at the top are redundant, I would suggest removing them and just using the kwargs from LocalCUDACluster. I also see you're using ucx_net_devices="auto" without IB, please see refer to this warning: https://github.com/rapidsai/dask-cuda/blob/302d1b8d422dbb981a0e36b1c7f14941cfd80ef7/dask_cuda/local_cuda_cluster.py#L96-L101

Yes I'm aware of the warning and the choice that default is not ucx in rapids dask_cudf. I'm using conda install with ucx-py with matched versions, however, so I thought it would work. I'm not using infiniband, this is just a simple single node system with 2 GPUs.

I would suggest you try ucx_net_devices="eth0", replacing "eth0" to the Ethernet interface your system is using. Your RMM pool size looks also very small, you're probably better off with a pool that's 90-95% of your GPU's memory, although I'm not sure if xgboost is using RMM or if it'll just compete for memory with Dask-CUDA, in that case you may see other issues.

Yes, the docs were not entirely clear. E.g the link I posted: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only shows 1GB RMM but then passes 24GB RMM pool. I guess I didn't follow if one was per worker and the other was across all workers or what.

I'm also not sure xgboost uses anything related to RMM, but the docs seem to stress that using RMM is required to avoid slowness. I can have it use most of GPU memory, but again it wasn't clear from docs if it was per worker-GPU or per cluster or per node etc.

It would also help if you post the output of conda list so we can see what versions of packages you're using.

Yes, as the setup link shows I'm using rapids 0.14 as I can't quite update to >cuda 10.0 yet. So perhaps some things have been fixed already.

Here are my explicit packages used in my conda environment that was constructed as a self-consistent conda solution (i.e. no conflicts). Note, however, that rapids 0.14 fails to work properly if go by the conda versions alone, as I discovered when trying to use rapids 0.14. E.g. one cannot use new dask with rapids 0.14, despite version limits suggesting one can and conda happily installing it. There are many things like that (critically pandas, numpy, fastavro, fsspec, dask, and distributed). So I had to choose versions that go back to when 0.14 was released by NVIDIA in order to reconstruct what conda would have done then. This leads to essentially (all but a few) rapids tests passing (i.e. cudf, cuml, cugraph, cusignal, rmm, ucx, pyarrow, dask_cuda, etc.). Once I move to rapids 0.15 with cuda 10.2 and python38 I can see if dask/distributed updates help things

conda_explicit.txt.zip

pseudotensor commented 4 years ago

BTW, it's highly repeatable. I'll have to stop using ucx for now. Do you have any other suggestions apart from RMM? Note that xgboost using much more than 1GB in some cases, but basically I'm asking specifically for how to debug the specific ucx error I see that precedes the hang:

[1603408455.594336] [mr-dl10:15591:0]           sock.c:344  UCX  ERROR recv(fd=145) failed: Connection reset by peer
[1603408455.644012] [mr-dl10:26561:0]          mpool.c:43   UCX  WARN  object 0x14e9deffef40 was not returned to mpool ucp_am_bufs
[1603408455.593751] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b527fcec0 was not returned to mpool ucp_am_bufs
[1603408455.593766] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b52ffcf40 was not returned to mpool ucp_am_bufs
[1603408455.593770] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b537fcfc0 was not returned to mpool ucp_am_bufs

However, I also see times when those messages are shown, but there was no hang or issue, even the ERROR one.

quasiben commented 4 years ago

@pseudotensor i re-wrote your fit example and ran with both 0.15 and latest nightly and didn't see any errors like the one reported:

def fit():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster
    import pandas as pd

    with LocalCUDACluster(CUDA_VISIBLE_DEVICES=[0,1],
                          protocol="ucx",
                          enable_nvlink=True,
                          device_memory_limit="28GB",
                          rmm_pool_size="30GB") as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard.csv")
            X = dask_cudf.read_csv("creditcard.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.fit(X, y, verbose=True)

jakirkham commented 4 years ago

Guessing it is mainly the small initial pool sizes that results in issues. A larger initial pool size likely avoids it.

pentschev commented 4 years ago

@pentschev Thanks, I assumed they were redundant, except some didn't match, like cuda_copy one is not in the kwargs explicitly at least.

Yes, cuda_copy is implicit when you enable any of the UCX transports via `enable_*, as this is required in all cases.

I was just trying to follow your docs as closely as possible, where there is also such redundancy:

https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only

I can totally understand the confusion, I think this is a bit outdated now and at the time we weren't sure ourselves if this was necessary, as we tried to state with this sentence in that page: "One possible exception is DASK_RMM__POOL_SIZE, at this time it’s unclear whether this is necessary or not, but using that should not cause any issues nevertheless.".

Yes I'm aware of the warning and the choice that default is not ucx in rapids dask_cudf. I'm using conda install with ucx-py with matched versions, however, so I thought it would work. I'm not using infiniband, this is just a simple single node system with 2 GPUs.

In that case you should use should not use ucx_net_devices="auto", but instead pass your Ethernet interface. If you try that, does that change anything for you?

Yes, the docs were not entirely clear. E.g the link I posted: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only shows 1GB RMM but then passes 24GB RMM pool. I guess I didn't follow if one was per worker and the other was across all workers or what.

In the doc, 1GB is meant specifically for the client process only, whereas 24GB will be passed down to each worker, and that 24GB is per worker (GPU). Unfortunately, we can't control the client's memory pool in any other way, as we do for the workers in LocalCUDACluster.

I'm also not sure xgboost uses anything related to RMM, but the docs seem to stress that using RMM is required to avoid slowness. I can have it use most of GPU memory, but again it wasn't clear from docs if it was per worker-GPU or per cluster or per node etc.

Using a memory pool is indeed very important for performance in RAPIDS as a whole, and even more for UCX-Py. However, I don't know if there's a way you can balance xgboost memory usage with that of RAPIDS/Dask.

Here are my explicit packages used in my conda environment that was constructed as a self-consistent conda solution (i.e. no conflicts). Note, however, that rapids 0.14 fails to work properly if go by the conda versions alone, as I discovered when trying to use rapids 0.14. E.g. one cannot use new dask with rapids 0.14, despite version limits suggesting one can and conda happily installing it. There are many things like that (critically pandas, numpy, fastavro, fsspec, dask, and distributed). So I had to choose versions that go back to when 0.14 was released by NVIDIA in order to reconstruct what conda would have done then. This leads to essentially (all but a few) rapids tests passing (i.e. cudf, cuml, cugraph, cusignal, rmm, ucx, pyarrow, dask_cuda, etc.). Once I move to rapids 0.15 with cuda 10.2 and python38 I can see if dask/distributed updates help things

Yes, unfortunately we can't pin maximum versions of those libraries as we never know when a fix will be added to those libraries for some bug that wasn't known at the time of RAPIDS release. We only test stable and development versions, meaning that users that can't upgrade to one of those versions are unfortunately out of luck, as we don't anymore check whether there are non-backwards compatible changes. Note that RAPIDS 0.16 was released a couple days ago, so I suggest going to stable and not the 0.15 (which is now legacy).

BTW, it's highly repeatable. I'll have to stop using ucx for now. Do you have any other suggestions apart from RMM? Note that xgboost using much more than 1GB in some cases, but basically I'm asking specifically for how to debug the specific ucx error I see that precedes the hang:
[1603408455.594336] [mr-dl10:15591:0]           sock.c:344  UCX  ERROR recv(fd=145) failed: Connection reset by peer
[1603408455.644012] [mr-dl10:26561:0]          mpool.c:43   UCX  WARN  object 0x14e9deffef40 was not returned to mpool ucp_am_bufs
[1603408455.593751] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b527fcec0 was not returned to mpool ucp_am_bufs
[1603408455.593766] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b52ffcf40 was not returned to mpool ucp_am_bufs
[1603408455.593770] [mr-dl10:26558:0]          mpool.c:43   UCX  WARN  object 0x151b537fcfc0 was not returned to mpool ucp_am_bufs
However, I also see times when those messages are shown, but there was no hang or issue, even the ERROR one.

The warnings suggest this is at the very end and the error just happens at exit time, is that correct? If that's the case, I think this is related to a bug in dask/distributed that I was able to identify yesterday: https://github.com/dask/distributed/issues/4181 . Other than that, I would like to know whether using the Ethernet interface instead of ucx_net_devices="auto" changes anything for you, but apart from that I don't see anything wrong, maybe https://github.com/rapidsai/ucx-py/issues/655#issuecomment-714855775 could also be something you could test just to see if there's any difference from your original code.

quasiben commented 4 years ago

It's also worth noting that as of 1.20 (I think) RMM should be available as an allocator in XGBoost: https://github.com/dmlc/xgboost/pull/5873

pseudotensor commented 4 years ago

Thanks, I'll try specifying the ethernet device, this is just a bit awkward to have to do when only using a single node. ucx would be better if it worked on single node without having to specify this.

Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.

Thanks for notice on rapids 0.16 being released, definitely will jump to that instead.

jakirkham commented 4 years ago

Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.

I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.

pentschev commented 4 years ago

Thanks, I'll try specifying the ethernet device, this is just a bit awkward to have to do when only using a single node. ucx would be better if it worked on single node without having to specify this.

You normally don't need to, I suggested it because you had ucx_net_devices already, which is likely because of the example you based your code on. Unless you have multiple ethernet devices that are connected, you're likely fine just removing that entirely.

jakirkham commented 4 years ago

Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.

I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.

Following up on this, so this should already be enabled in 0.15. This should just work automatically. Though it does require that one is getting xgboost from the rapidsai Conda channel. So may be worth rechecking how xgboost was installed.

pseudotensor commented 4 years ago

Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.

I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.

Following up on this, so this should already be enabled in 0.15. This should just work automatically. Though it does require that one is getting xgboost from the rapidsai Conda channel. So may be worth rechecking how xgboost was installed.

Thanks, I had been wondering about the rapids-xgboost and if/how different from xgboost. Is there some reason why the changes nvidia makes are not just already in dmlc? It's confusing, but also I thought many people at nvidia are making changes to dmlc directly anyways.

trivialfis commented 4 years ago

@pseudotensor

I had been wondering about the rapids-xgboost and if/how different from xgboost

The difference is in how it's built. On rapids channel the CMake flag USE_RMM is enabled but disabled on dmlc binary release. I don't think there's any other difference.

Hanging is usually the result of another abortion. Segfault leads to restart of worker and XGBoost will try to establish the allreduce tree/ring with rest of the workers, which is impossible as others are in different states.

I'm confused by this issue. So:

On TCP socket it works fine but not on UCX.
Specifying a large RMM pool works even with UCX.
In crossed referenced issue in xgboost https://github.com/dmlc/xgboost/issues/6279 , there is a backtrace suggesting xgboost has invalid memory access.

Is this a fair summary?

rapidsai / ucx-py

Running dask_cudf on 2 GPUs and everything hangs #655