Open beckernick opened 3 years ago
As a baseline, we ran every query 5 times on a standard cluster 8 GPUs of a DGX-2 with a 15GB device memory limit and 30GB RMM pool.
With DASK_JIT_UNSPILL=True
, Q02 ran hit some memory issues in some runs with UCX, as well the following in others with TCP:
QUERY=02; cd queries/q$QUERY; python tpcx_bb_query_$QUERY\.py --config_file ../../benchmark_runner/benchmark_config.yaml ; cd ../../
Using default arguments
{
"type": "Scheduler",
"id": "Scheduler-d57bdb26-1dd2-4478-82f7-92fb51a39c09",
"address": "tcp://10.33.228.70:8786",
"services": {
"dashboard": 8787
},
"started": 1611349272.8086827,
"workers": {}
}
Connected!
Encountered Exception while running query
Traceback (most recent call last):
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/xbb_tools/utils.py", line 280, in run_dask_cudf_query
config=config,
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/xbb_tools/utils.py", line 61, in benchmark
result = func(*args, **kwargs)
File "tpcx_bb_query_02.py", line 143, in main
result_df = result_df.head(q02_limit)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/core.py", line 1036, in head
return self._head(n=n, npartitions=npartitions, compute=compute, safe=True)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/core.py", line 1069, in _head
result = result.compute()
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/base.py", line 279, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/base.py", line 561, in compute
results = schedule(dsk, keys, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 2681, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 1996, in gather
asynchronous=asynchronous,
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 837, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/distributed/client.py", line 1855, in _gather
raise exception.with_traceback(traceback)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/dask/dataframe/shuffle.py", line 1162, in shuffle_group
ind = hash_object_dispatch(df[cols] if cols else df, index=False)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/series.py", line 906, in __getitem__
return self._get_with(key)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/series.py", line 946, in _get_with
return self.loc[key]
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122/lib/python3.7/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['wcs_user_sk'], dtype='object')] are in the [index]"
conda list | grep "rapids\|blazing\|dask\|distr\|pandas"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210122:
blazingsql 0.18.0a0 pypi_0 pypi
cudf 0.18.0a210122 cuda_10.2_py37_g6c116e382f_191 rapidsai-nightly
cuml 0.18.0a210122 cuda10.2_py37_g29f7d08e9_82 rapidsai-nightly
dask 2021.1.0 pyhd8ed1ab_0 conda-forge
dask-core 2021.1.0 pyhd8ed1ab_0 conda-forge
dask-cuda 0.18.0a201211 py37_39 http://conda-mirror.gpuci.io/rapidsai-nightly
dask-cudf 0.18.0a210122 py37_g6c116e382f_191 http://conda-mirror.gpuci.io/rapidsai-nightly
distributed 2021.1.0 py37h89c1867_1 conda-forge
faiss-proc 1.0.0 cuda http://conda-mirror.gpuci.io/rapidsai-nightly
libcudf 0.18.0a210122 cuda10.2_g6c116e382f_191 rapidsai-nightly
libcuml 0.18.0a210122 cuda10.2_g29f7d08e9_82 rapidsai-nightly
libcumlprims 0.18.0a201203 cuda10.2_gff080f3_0 http://conda-mirror.gpuci.io/rapidsai-nightly
librmm 0.18.0a210122 cuda10.2_g1502058_24 rapidsai-nightly
pandas 1.1.5 py37hdc94413_0 conda-forge
rmm 0.18.0a210122 cuda_10.2_py37_g1502058_24 http://conda-mirror.gpuci.io/rapidsai-nightly
ucx 1.9.0+gcd9efd3 cuda10.2_0 http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-proc 1.0.0 gpu http://conda-mirror.gpuci.io/rapidsai-nightly
ucx-py 0.18.0a210122 py37_gcd9efd3_10 http://conda-mirror.gpuci.io/rapidsai-nightly
Sorry for the late reply, I wasn't aware of this issue :/
For some reason rapidsai-nightly
contains a old version of dask-cuda
(0.18.0a201211
) thus setting DASK_JIT_UNSPILL=True
uses the old JIT spilling from last year.
Having said that, I am debugging a deadlock with the new JIT spilling that Q02 triggers when the device limit is 15GB (as opposed to the 20GB, which I have been using when testing). Will let you know when I have a fix.
CC: @ChrisJar for awareness.
Having said that, I am debugging a deadlock with the new JIT spilling that Q02 triggers when the device limit is 15GB (as opposed to the 20GB, which I have been using when testing). Will let you know when I have a fix.
The deadlock issue should be fixed in the latest version of dask-cuda https://github.com/rapidsai/dask-cuda/pull/501
This functionality is now optionally available in nightlies, and we should evaluate how this affects performance (particularly with UCX).