Open pseudotensor opened 4 years ago
When you run with just TCP do you still see this error ?
When using LocalCUDACluster
, the environment variables you have at the top are redundant, I would suggest removing them and just using the kwargs from LocalCUDACluster
. I also see you're using ucx_net_devices="auto"
without IB, please see refer to this warning: https://github.com/rapidsai/dask-cuda/blob/302d1b8d422dbb981a0e36b1c7f14941cfd80ef7/dask_cuda/local_cuda_cluster.py#L96-L101
I would suggest you try ucx_net_devices="eth0"
, replacing "eth0"
to the Ethernet interface your system is using. Your RMM pool size looks also very small, you're probably better off with a pool that's 90-95% of your GPU's memory, although I'm not sure if xgboost is using RMM or if it'll just compete for memory with Dask-CUDA, in that case you may see other issues. It would also help if you post the output of conda list
so we can see what versions of packages you're using.
@pentschev Thanks, I assumed they were redundant, except some didn't match, like cuda_copy one is not in the kwargs explicitly at least.
I was just trying to follow your docs as closely as possible, where there is also such redundancy:
https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only
When you run with just TCP do you still see this error ?
No such hangs or errors appear in tcp mode. Note the message is only about ucx.
When using
LocalCUDACluster
, the environment variables you have at the top are redundant, I would suggest removing them and just using the kwargs fromLocalCUDACluster
. I also see you're usingucx_net_devices="auto"
without IB, please see refer to this warning: https://github.com/rapidsai/dask-cuda/blob/302d1b8d422dbb981a0e36b1c7f14941cfd80ef7/dask_cuda/local_cuda_cluster.py#L96-L101
Yes I'm aware of the warning and the choice that default is not ucx in rapids dask_cudf. I'm using conda install with ucx-py with matched versions, however, so I thought it would work. I'm not using infiniband, this is just a simple single node system with 2 GPUs.
I would suggest you try
ucx_net_devices="eth0"
, replacing"eth0"
to the Ethernet interface your system is using. Your RMM pool size looks also very small, you're probably better off with a pool that's 90-95% of your GPU's memory, although I'm not sure if xgboost is using RMM or if it'll just compete for memory with Dask-CUDA, in that case you may see other issues.
Yes, the docs were not entirely clear. E.g the link I posted: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only shows 1GB RMM but then passes 24GB RMM pool. I guess I didn't follow if one was per worker and the other was across all workers or what.
I'm also not sure xgboost uses anything related to RMM, but the docs seem to stress that using RMM is required to avoid slowness. I can have it use most of GPU memory, but again it wasn't clear from docs if it was per worker-GPU or per cluster or per node etc.
It would also help if you post the output of
conda list
so we can see what versions of packages you're using.
Yes, as the setup link shows I'm using rapids 0.14 as I can't quite update to >cuda 10.0 yet. So perhaps some things have been fixed already.
Here are my explicit packages used in my conda environment that was constructed as a self-consistent conda solution (i.e. no conflicts). Note, however, that rapids 0.14 fails to work properly if go by the conda versions alone, as I discovered when trying to use rapids 0.14. E.g. one cannot use new dask with rapids 0.14, despite version limits suggesting one can and conda happily installing it. There are many things like that (critically pandas, numpy, fastavro, fsspec, dask, and distributed). So I had to choose versions that go back to when 0.14 was released by NVIDIA in order to reconstruct what conda would have done then. This leads to essentially (all but a few) rapids tests passing (i.e. cudf, cuml, cugraph, cusignal, rmm, ucx, pyarrow, dask_cuda, etc.). Once I move to rapids 0.15 with cuda 10.2 and python38 I can see if dask/distributed updates help things
BTW, it's highly repeatable. I'll have to stop using ucx for now. Do you have any other suggestions apart from RMM? Note that xgboost using much more than 1GB in some cases, but basically I'm asking specifically for how to debug the specific ucx error I see that precedes the hang:
[1603408455.594336] [mr-dl10:15591:0] sock.c:344 UCX ERROR recv(fd=145) failed: Connection reset by peer
[1603408455.644012] [mr-dl10:26561:0] mpool.c:43 UCX WARN object 0x14e9deffef40 was not returned to mpool ucp_am_bufs
[1603408455.593751] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b527fcec0 was not returned to mpool ucp_am_bufs
[1603408455.593766] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b52ffcf40 was not returned to mpool ucp_am_bufs
[1603408455.593770] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b537fcfc0 was not returned to mpool ucp_am_bufs
However, I also see times when those messages are shown, but there was no hang or issue, even the ERROR one.
@pseudotensor i re-wrote your fit example and ran with both 0.15 and latest nightly and didn't see any errors like the one reported:
def fit():
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import pandas as pd
with LocalCUDACluster(CUDA_VISIBLE_DEVICES=[0,1],
protocol="ucx",
enable_nvlink=True,
device_memory_limit="28GB",
rmm_pool_size="30GB") as cluster:
with Client(cluster) as client:
import xgboost as xgb
import dask_cudf
target = "default payment next month"
Xpd = pd.read_csv("creditcard.csv")
Xpd = Xpd[['AGE', target]]
Xpd.to_csv("creditcard.csv")
X = dask_cudf.read_csv("creditcard.csv")
y = X[target]
X = X.drop(target, axis=1)
kwargs_fit = {}
kwargs_cudf_fit = kwargs_fit.copy()
valid_X = dask_cudf.read_csv("creditcard.csv")
valid_y = valid_X[target]
valid_X = valid_X.drop(target, axis=1)
kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]
params = {} # copy.deepcopy(self.model.get_params())
params['tree_method'] = 'gpu_hist'
dask_model = xgb.dask.DaskXGBClassifier(**params)
dask_model.fit(X, y, verbose=True)
Guessing it is mainly the small initial pool sizes that results in issues. A larger initial pool size likely avoids it.
@pentschev Thanks, I assumed they were redundant, except some didn't match, like cuda_copy one is not in the kwargs explicitly at least.
Yes, cuda_copy
is implicit when you enable any of the UCX transports via `enable_*, as this is required in all cases.
I was just trying to follow your docs as closely as possible, where there is also such redundancy:
https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only
I can totally understand the confusion, I think this is a bit outdated now and at the time we weren't sure ourselves if this was necessary, as we tried to state with this sentence in that page: "One possible exception is DASK_RMM__POOL_SIZE, at this time it’s unclear whether this is necessary or not, but using that should not cause any issues nevertheless.".
Yes I'm aware of the warning and the choice that default is not ucx in rapids dask_cudf. I'm using conda install with ucx-py with matched versions, however, so I thought it would work. I'm not using infiniband, this is just a simple single node system with 2 GPUs.
In that case you should use should not use ucx_net_devices="auto"
, but instead pass your Ethernet interface. If you try that, does that change anything for you?
Yes, the docs were not entirely clear. E.g the link I posted: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only shows 1GB RMM but then passes 24GB RMM pool. I guess I didn't follow if one was per worker and the other was across all workers or what.
In the doc, 1GB is meant specifically for the client process only, whereas 24GB will be passed down to each worker, and that 24GB is per worker (GPU). Unfortunately, we can't control the client's memory pool in any other way, as we do for the workers in LocalCUDACluster
.
I'm also not sure xgboost uses anything related to RMM, but the docs seem to stress that using RMM is required to avoid slowness. I can have it use most of GPU memory, but again it wasn't clear from docs if it was per worker-GPU or per cluster or per node etc.
Using a memory pool is indeed very important for performance in RAPIDS as a whole, and even more for UCX-Py. However, I don't know if there's a way you can balance xgboost memory usage with that of RAPIDS/Dask.
Here are my explicit packages used in my conda environment that was constructed as a self-consistent conda solution (i.e. no conflicts). Note, however, that rapids 0.14 fails to work properly if go by the conda versions alone, as I discovered when trying to use rapids 0.14. E.g. one cannot use new dask with rapids 0.14, despite version limits suggesting one can and conda happily installing it. There are many things like that (critically pandas, numpy, fastavro, fsspec, dask, and distributed). So I had to choose versions that go back to when 0.14 was released by NVIDIA in order to reconstruct what conda would have done then. This leads to essentially (all but a few) rapids tests passing (i.e. cudf, cuml, cugraph, cusignal, rmm, ucx, pyarrow, dask_cuda, etc.). Once I move to rapids 0.15 with cuda 10.2 and python38 I can see if dask/distributed updates help things
Yes, unfortunately we can't pin maximum versions of those libraries as we never know when a fix will be added to those libraries for some bug that wasn't known at the time of RAPIDS release. We only test stable and development versions, meaning that users that can't upgrade to one of those versions are unfortunately out of luck, as we don't anymore check whether there are non-backwards compatible changes. Note that RAPIDS 0.16 was released a couple days ago, so I suggest going to stable and not the 0.15 (which is now legacy).
BTW, it's highly repeatable. I'll have to stop using ucx for now. Do you have any other suggestions apart from RMM? Note that xgboost using much more than 1GB in some cases, but basically I'm asking specifically for how to debug the specific ucx error I see that precedes the hang:
[1603408455.594336] [mr-dl10:15591:0] sock.c:344 UCX ERROR recv(fd=145) failed: Connection reset by peer [1603408455.644012] [mr-dl10:26561:0] mpool.c:43 UCX WARN object 0x14e9deffef40 was not returned to mpool ucp_am_bufs [1603408455.593751] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b527fcec0 was not returned to mpool ucp_am_bufs [1603408455.593766] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b52ffcf40 was not returned to mpool ucp_am_bufs [1603408455.593770] [mr-dl10:26558:0] mpool.c:43 UCX WARN object 0x151b537fcfc0 was not returned to mpool ucp_am_bufs
However, I also see times when those messages are shown, but there was no hang or issue, even the ERROR one.
The warnings suggest this is at the very end and the error just happens at exit time, is that correct? If that's the case, I think this is related to a bug in dask/distributed that I was able to identify yesterday: https://github.com/dask/distributed/issues/4181 . Other than that, I would like to know whether using the Ethernet interface instead of ucx_net_devices="auto"
changes anything for you, but apart from that I don't see anything wrong, maybe https://github.com/rapidsai/ucx-py/issues/655#issuecomment-714855775 could also be something you could test just to see if there's any difference from your original code.
It's also worth noting that as of 1.20 (I think) RMM should be available as an allocator in XGBoost: https://github.com/dmlc/xgboost/pull/5873
Thanks, I'll try specifying the ethernet device, this is just a bit awkward to have to do when only using a single node. ucx would be better if it worked on single node without having to specify this.
Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.
Thanks for notice on rapids 0.16 being released, definitely will jump to that instead.
Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.
I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.
Thanks, I'll try specifying the ethernet device, this is just a bit awkward to have to do when only using a single node. ucx would be better if it worked on single node without having to specify this.
You normally don't need to, I suggested it because you had ucx_net_devices
already, which is likely because of the example you based your code on. Unless you have multiple ethernet devices that are connected, you're likely fine just removing that entirely.
Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.
I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.
Following up on this, so this should already be enabled in 0.15. This should just work automatically. Though it does require that one is getting xgboost
from the rapidsai
Conda channel. So may be worth rechecking how xgboost
was installed.
Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me.
I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today.
Following up on this, so this should already be enabled in 0.15. This should just work automatically. Though it does require that one is getting
xgboost
from therapidsai
Conda channel. So may be worth rechecking howxgboost
was installed.
Thanks, I had been wondering about the rapids-xgboost and if/how different from xgboost. Is there some reason why the changes nvidia makes are not just already in dmlc? It's confusing, but also I thought many people at nvidia are making changes to dmlc directly anyways.
@pseudotensor
I had been wondering about the rapids-xgboost and if/how different from xgboost
The difference is in how it's built. On rapids channel the CMake flag USE_RMM
is enabled but disabled on dmlc binary release. I don't think there's any other difference.
Hanging is usually the result of another abortion. Segfault leads to restart of worker and XGBoost will try to establish the allreduce tree/ring with rest of the workers, which is impossible as others are in different states.
I'm confused by this issue. So:
Is this a fair summary?
See for setup details: https://github.com/dmlc/xgboost/issues/6232
Last thing in logs before hang is:
If I poke the faulthandler in python to see where things are hung for a process using 100% of 1 core, I see:
I run dask_cudf like:
I'm only giving a schematic of the code. It is not an MRE yet. But, for me here the hang is during the predict.
I'm new to using ucx, and hadn't seen this kind of hang before when using default options.
Any ideas?
Thanks!