Closed wence- closed 1 year ago
Note that running with compute-sanitizer --tool initcheck
we noticed various places where bitmask-related code is not initcheck clean, but all the ones we saw I think are benign (uninitialized data are read and masked).
I was not able to catch the bug in the act with initcheck
(it's just too noisy, see discussion in #12667). The code is memcheck
clean.
Here is an updated reproducer that (for me) consistently fails after two iterations (with pool allocator off) and four iterations (with pool allocator on). There is no randomness in the input data here.
Note that the dask computation is doing a cumsum
, whereas the things we are comparing are doing a cumcount
. I've also modified things such that the check for correctness first checks that the NA values match up in the cudf and pandas results, and then checks for equality of the dataframes. The NA values not matching up always fires.
from itertools import count
import cudf
import dask_cudf
import numpy as np
import rmm
from cudf.testing._utils import assert_eq
if __name__ == "__main__":
rmm.reinitialize(pool_allocator=False)
for i in count():
size = 10_000
gdf_original = cudf.DataFrame(
{
"xx": np.arange(size, dtype="int32") % 5,
"x": np.zeros(size, dtype="int32"),
"y": np.zeros(size, dtype="int32"),
},
)
# insert nulls into the key column at random.
gdf_original["xx"] = gdf_original.xx.mask(
(np.arange(size, dtype="int32") % 2).astype("bool")
)
pdf = gdf_original.to_pandas(nullable=False)
gdf = gdf_original
assert_eq(gdf, pdf)
gdf_grouped = gdf.groupby("xx")
cudf_result = gdf_grouped.cumcount()
# Although we don't look at the data, this seems pretty
# crucial to provoking the issue.
# Notice this never touches the gdf_original data! And the
# computation is _done_ by the time we're here, so one
# suspicion is that there is garbage left lying around for the
# next iteration.
ddf = dask_cudf.from_cudf(cudf.from_pandas(pdf), npartitions=5).persist()
ddf_grouped = ddf.groupby("xx")
dask_cudf_result = ddf_grouped.cumsum().compute(scheduler="sync")
pandas_result = pdf.groupby("xx").cumcount()
print(i)
# This occasionally fails
#
cudf_na = cudf_result.isna().values_host
pandas_na = pandas_result.isna().values
difference = np.where(cudf_na != pandas_na)
assert (cudf_na == pandas_na).all(), difference
# Bug in cumcount return value name
cudf_result.name = None
assert_eq(cudf_result, pandas_result)
# This was for checking when we removed the reindex call in groupby._mimic_pandas_order
# mask = ~pdf.xx.isna()
# pandas_result_no_nulls = pandas_result.loc[mask]
# assert_eq(cudf_result.sort_index(), pandas_result_no_nulls, check_dtype=False)
Here's the output for me:
ipython --pdb bug.py
0
1
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
File ~/doodles/python/dask-cudf/bug.py:47
45 pandas_na = pandas_result.isna().values
46 difference = np.where(cudf_na != pandas_na)
---> 47 assert (cudf_na == pandas_na).all(), difference
49 # Bug in cumcount return value name
50 cudf_result.name = None
AssertionError: (array([9924, 9934, 9944, 9954, 9964, 9974, 9984, 9994]),)
> /home/wence/Documents/src/rapids/doodles/python/dask-cudf/bug.py(47)<module>()
45 pandas_na = pandas_result.isna().values
46 difference = np.where(cudf_na != pandas_na)
---> 47 assert (cudf_na == pandas_na).all(), difference
48
49 # Bug in cumcount return value name
ipdb>
It always seems to be the same values that are different.
From some investigations by others, it seems this issue might be somewhat hardware specific, here's the output of print_env.sh
:
**git*** commit 87a8ede8dcd9b6cd6e38c41f74daec316f48e7db (HEAD -> branch-23.08, upstream/branch-23.08) Author: Robert MaynardDate: Tue May 30 16:47:59 2023 -0400 Ensure cccl packages don't clash with upstream version (#13235) Depends on: https://github.com/rapidsai/rapids-cmake/pull/393 Once the above PR is merged, this updated logic ensures that cudf places the custom versions of cccl packages in correct places, and can find them once installed. Authors: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: https://github.com/rapidsai/cudf/pull/13235 **git submodules*** ***OS Information*** DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS" PRETTY_NAME="Ubuntu 22.04.1 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.1 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy Linux shallot 5.19.0-42-generic #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ***GPU Information*** Thu Jun 1 09:51:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:17:00.0 Off | Off | | 30% 42C P8 21W / 300W | 6MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 Off | 00000000:B3:00.0 On | Off | | 30% 48C P5 34W / 300W | 1747MiB / 49140MiB | 9% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3157 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 3157 G /usr/lib/xorg/Xorg 862MiB | | 1 N/A N/A 3538 G /usr/bin/gnome-shell 302MiB | | 1 N/A N/A 5896 G evolution 5MiB | | 1 N/A N/A 6055 G ...0/usr/lib/firefox/firefox 284MiB | | 1 N/A N/A 6425 G ...veSuggestionsOnlyOnDemand 185MiB | +-----------------------------------------------------------------------------+ ***CPU*** Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 7 CPU max MHz: 3900.0000 CPU min MHz: 1200.0000 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 22 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled ***CMake*** /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/bin/cmake cmake version 3.26.4 CMake suite maintained and supported by Kitware (kitware.com/cmake). ***g++*** /usr/local/sbin/g++ g++ (conda-forge gcc 11.3.0-19) 11.3.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ***nvcc*** /usr/local/sbin/nvcc nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 ***Python*** /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/bin/python Python 3.10.11 ***Environment Variables*** PATH : /usr/local/sbin:/usr/local/bin:/home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/bin:/home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/cuda/bin LD_LIBRARY_PATH : /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib:/home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/lib:/home/wence/Documents/src/rapids/rmm/build/release:/home/wence/Documents/src/rapids/cudf/cpp/build/release:/home/wence/Documents/src/rapids/raft/cpp/build/release:/home/wence/Documents/src/rapids/cuml/cpp/build/release:/home/wence/Documents/src/rapids/cugraph/cpp/build/release:/home/wence/Documents/src/rapids/cuspatial/cpp/build/release NUMBAPRO_NVVM : NUMBAPRO_LIBDEVICE : CONDA_PREFIX : /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids PYTHON_PATH : ***conda packages*** /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/bin/conda # packages in environment at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge _sysroot_linux-64_curr_repodata_hack 3 h69a702a_13 conda-forge accessible-pygments 0.0.4 pyhd8ed1ab_0 conda-forge aiobotocore 2.5.0 pyhd8ed1ab_0 conda-forge aiohttp 3.8.4 py310h1fa729e_0 conda-forge aioitertools 0.11.0 pyhd8ed1ab_0 conda-forge aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge alabaster 0.7.13 pyhd8ed1ab_0 conda-forge anyio 3.7.0 pyhd8ed1ab_1 conda-forge argon2-cffi 21.3.0 pyhd8ed1ab_0 conda-forge argon2-cffi-bindings 21.2.0 py310h5764c6d_3 conda-forge arrow-cpp 11.0.0 ha770c72_21_cpu conda-forge arsenal 3.0 pypi_0 pypi asttokens 2.2.1 pyhd8ed1ab_0 conda-forge astunparse 1.6.3 pypi_0 pypi async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge attrs 23.1.0 pyh71513ae_1 conda-forge aws-c-auth 0.6.27 he072965_1 conda-forge aws-c-cal 0.5.26 hf677bf3_1 conda-forge aws-c-common 0.8.19 hd590300_0 conda-forge aws-c-compression 0.2.16 hbad4bc6_7 conda-forge aws-c-event-stream 0.2.20 hb4b372c_7 conda-forge aws-c-http 0.7.7 h2632f9a_4 conda-forge aws-c-io 0.13.21 h9fef7b8_5 conda-forge aws-c-mqtt 0.8.11 h2282364_1 conda-forge aws-c-s3 0.3.0 hcb5a9b2_2 conda-forge aws-c-sdkutils 0.1.9 hbad4bc6_2 conda-forge aws-checksums 0.1.14 hbad4bc6_7 conda-forge aws-crt-cpp 0.20.2 he0fdcb3_0 conda-forge aws-sam-translator 1.55.0 pyhd8ed1ab_0 conda-forge aws-sdk-cpp 1.10.57 h059227d_13 conda-forge aws-xray-sdk 2.12.0 pyhd8ed1ab_0 conda-forge babel 2.12.1 pyhd8ed1ab_1 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 pyhd8ed1ab_3 conda-forge backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge backports.zoneinfo 0.2.1 py310hff52083_7 conda-forge bcrypt 3.2.2 py310h5764c6d_1 conda-forge beautifulsoup4 4.12.2 pyha770c72_0 conda-forge binutils 2.39 hdd6e379_1 conda-forge binutils_impl_linux-64 2.39 he00db2b_1 conda-forge binutils_linux-64 2.39 h5fc0e48_13 conda-forge blas 1.0 mkl conda-forge bleach 6.0.0 pyhd8ed1ab_0 conda-forge blinker 1.6.2 pyhd8ed1ab_0 conda-forge bokeh 2.4.3 pyhd8ed1ab_3 conda-forge boto3 1.26.76 pyhd8ed1ab_0 conda-forge botocore 1.29.76 pyhd8ed1ab_0 conda-forge brotlipy 0.7.0 py310h5764c6d_1005 conda-forge bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.19.1 hd590300_0 conda-forge c-compiler 1.5.2 h0b41bf4_0 conda-forge ca-certificates 2023.5.7 hbcca054_0 conda-forge cachetools 5.3.0 pyhd8ed1ab_0 conda-forge certifi 2023.5.7 pyhd8ed1ab_0 conda-forge cffi 1.15.1 py310h255011f_3 conda-forge cfgv 3.3.1 pyhd8ed1ab_0 conda-forge cfn-lint 0.75.1 pyhd8ed1ab_0 conda-forge charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge click 8.1.3 unix_pyhd8ed1ab_2 conda-forge cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge cmake 3.26.4 hcfe8598_0 conda-forge cmake_setuptools 0.1.3 py_0 rapidsai colorama 0.4.6 pyhd8ed1ab_0 conda-forge comm 0.1.3 pyhd8ed1ab_0 conda-forge commonmark 0.9.1 py_0 conda-forge contourpy 1.0.7 pypi_0 pypi coverage 7.2.7 py310h2372a71_0 conda-forge cryptography 41.0.0 py310h75e40e8_0 conda-forge cubinlinker 0.3.0 py310hfdf336d_0 rapidsai cuda-python 11.8.1 py310h01a121a_2 conda-forge cuda-sanitizer-api 11.8.86 0 nvidia cudatoolkit 11.8.0 h37601d7_11 conda-forge cupy 12.0.0 py310h9216885_1 conda-forge cursor 1.3.5 pypi_0 pypi cxx-compiler 1.5.2 hf52228f_0 conda-forge cycler 0.11.0 pypi_0 pypi cyrus-sasl 2.1.27 h9033bb2_6 conda-forge cython 0.29.35 py310hc6cd4ac_0 conda-forge cytoolz 0.12.0 py310h5764c6d_1 conda-forge dask 2023.5.0 pypi_0 pypi dask-cuda 23.4.0a0+39.g06fb4e2.dirty pypi_0 pypi dask-glm 0.2.1.dev52+g1daf4c5 pypi_0 pypi dataclasses 0.8 pyhc8e2a94_3 conda-forge datasets 2.12.0 pyhd8ed1ab_0 conda-forge debugpy 1.6.7 py310heca2aa9_0 conda-forge decopatch 1.4.10 pyhd8ed1ab_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge dill 0.3.6 pyhd8ed1ab_1 conda-forge distlib 0.3.6 pyhd8ed1ab_0 conda-forge distributed 2023.5.0 pypi_0 pypi distro 1.8.0 pyhd8ed1ab_0 conda-forge dlpack 0.5 h9c3ff4c_0 conda-forge docker-py 6.1.0 pyhd8ed1ab_0 conda-forge docutils 0.19 py310hff52083_1 conda-forge doxygen 1.8.20 had0d8f1_0 conda-forge ecdsa 0.18.0 pyhd8ed1ab_1 conda-forge entrypoints 0.4 pyhd8ed1ab_0 conda-forge et-xmlfile 1.1.0 pypi_0 pypi exceptiongroup 1.1.1 pyhd8ed1ab_0 conda-forge execnet 1.9.0 pyhd8ed1ab_0 conda-forge executing 1.2.0 pyhd8ed1ab_0 conda-forge expat 2.5.0 hcb278e6_1 conda-forge fancycompleter 0.9.1 pypi_0 pypi fastavro 1.7.4 py310h2372a71_0 conda-forge fastrlock 0.8 py310hd8f1fbe_3 conda-forge filelock 3.12.0 pyhd8ed1ab_0 conda-forge flask 2.3.2 pyhd8ed1ab_0 conda-forge flask_cors 3.0.10 pyhd3deb0d_0 conda-forge flit-core 3.9.0 pyhd8ed1ab_0 conda-forge fmt 9.1.0 h924138e_0 conda-forge fonttools 4.39.0 pypi_0 pypi freetype 2.12.1 hca18f0e_1 conda-forge frozenlist 1.3.3 py310h5764c6d_0 conda-forge fsspec 2023.5.0 pyh1a96a4e_0 conda-forge future 0.18.3 pyhd8ed1ab_0 conda-forge gcc 11.3.0 h02d0930_13 conda-forge gcc_impl_linux-64 11.3.0 hab1b70f_19 conda-forge gcc_linux-64 11.3.0 he6f903b_13 conda-forge gcovr 5.2 pyhd8ed1ab_0 conda-forge gdb 12.1 py310hd73dadb_0 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge gitdb 4.0.10 pypi_0 pypi gitpython 3.1.31 pypi_0 pypi glog 0.6.0 h6f12383_0 conda-forge gmock 1.13.0 ha770c72_1 conda-forge gmp 6.2.1 h58526e2_0 conda-forge gmpy2 2.1.2 py310h3ec546c_1 conda-forge graphql-core 3.2.3 pyhd8ed1ab_0 conda-forge greenlet 2.0.2 py310hc6cd4ac_1 conda-forge gtest 1.13.0 h00ab1b0_1 conda-forge gxx 11.3.0 h02d0930_13 conda-forge gxx_impl_linux-64 11.3.0 hab1b70f_19 conda-forge gxx_linux-64 11.3.0 hc203a17_13 conda-forge halo 0.0.29 pypi_0 pypi huggingface_hub 0.14.1 pyhd8ed1ab_0 conda-forge hypothesis 6.75.7 pyha770c72_0 conda-forge icu 72.1 hcb278e6_0 conda-forge identify 2.5.24 pyhd8ed1ab_0 conda-forge idna 3.4 pyhd8ed1ab_0 conda-forge imagesize 1.4.1 pyhd8ed1ab_0 conda-forge importlib-metadata 6.6.0 pyha770c72_0 conda-forge importlib_metadata 6.6.0 hd8ed1ab_0 conda-forge iniconfig 2.0.0 pyhd8ed1ab_0 conda-forge ipykernel 6.23.1 pyh210e3f2_0 conda-forge ipython 8.13.2 pyh41d4057_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge itsdangerous 2.1.2 pyhd8ed1ab_0 conda-forge jedi 0.18.2 pyhd8ed1ab_0 conda-forge jinja2 3.1.2 pyhd8ed1ab_1 conda-forge jmespath 1.0.1 pyhd8ed1ab_0 conda-forge joblib 1.2.0 pyhd8ed1ab_0 conda-forge jschema-to-python 1.2.3 pyhd8ed1ab_0 conda-forge jsondiff 2.0.0 pyhd8ed1ab_0 conda-forge jsonpatch 1.32 pyhd8ed1ab_0 conda-forge jsonpickle 2.2.0 pyhd8ed1ab_0 conda-forge jsonpointer 2.0 py_0 conda-forge jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge junit-xml 1.9 pyh9f0ad1d_0 conda-forge jupyter-cache 0.6.1 pyhd8ed1ab_0 conda-forge jupyter_client 8.2.0 pyhd8ed1ab_0 conda-forge jupyter_core 5.3.0 py310hff52083_0 conda-forge jupyter_events 0.6.3 pyhd8ed1ab_0 conda-forge jupyter_server 2.6.0 pyhd8ed1ab_0 conda-forge jupyter_server_terminals 0.4.4 pyhd8ed1ab_1 conda-forge jupyterlab_pygments 0.2.2 pyhd8ed1ab_0 conda-forge kernel-headers_linux-64 3.10.0 h4a8ded7_13 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 pypi_0 pypi krb5 1.20.1 h81ceb04_0 conda-forge lcms2 2.15 haa2dc70_1 conda-forge ld_impl_linux-64 2.39 hcc3a1bd_1 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20230125.2 cxx17_h59595ed_2 conda-forge libarrow 11.0.0 h96638e8_21_cpu conda-forge libblas 3.9.0 16_linux64_mkl conda-forge libbrotlicommon 1.0.9 h166bdaf_8 conda-forge libbrotlidec 1.0.9 h166bdaf_8 conda-forge libbrotlienc 1.0.9 h166bdaf_8 conda-forge libcblas 3.9.0 16_linux64_mkl conda-forge libcrc32c 1.1.2 h9c3ff4c_0 conda-forge libcst 0.4.9 pypi_0 pypi libcufile 1.4.0.31 0 nvidia libcufile-dev 1.4.0.31 0 nvidia libcurand 10.3.0.86 0 nvidia libcurand-dev 10.3.0.86 0 nvidia libcurl 8.1.2 h409715c_0 conda-forge libdeflate 1.18 h0b41bf4_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libevent 2.1.12 h3358134_0 conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-devel_linux-64 11.3.0 h210ce93_19 conda-forge libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 12.2.0 h69a702a_19 conda-forge libgfortran5 12.2.0 h337968e_19 conda-forge libgomp 12.2.0 h65d4601_19 conda-forge libgoogle-cloud 2.10.1 hac9eb74_1 conda-forge libgrpc 1.54.2 hb20ce57_2 conda-forge libhwloc 2.9.1 hf98c7e7_1 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libjpeg-turbo 2.1.5.1 h0b41bf4_0 conda-forge libkvikio 23.08.00a cuda11_230530_g9481f89_7 rapidsai-nightly liblapack 3.9.0 16_linux64_mkl conda-forge libllvm14 14.0.6 hcd5def8_2 conda-forge libnghttp2 1.52.0 h61bc06f_0 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libntlm 1.4 h7f98852_1002 conda-forge libnuma 2.0.16 h0b41bf4_1 conda-forge libpng 1.6.39 h753d276_0 conda-forge libprotobuf 3.21.12 h3eb15da_0 conda-forge librdkafka 1.9.2 ha5a0de0_2 conda-forge libsanitizer 11.3.0 h239ccf8_19 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsqlite 3.42.0 h2797004_0 conda-forge libssh2 1.10.0 hf14f497_3 conda-forge libstdcxx-devel_linux-64 11.3.0 h210ce93_19 conda-forge libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libthrift 0.18.1 h8fd135c_1 conda-forge libtiff 4.5.0 ha587672_6 conda-forge libutf8proc 2.8.0 h166bdaf_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libuv 1.44.2 h166bdaf_0 conda-forge libwebp-base 1.3.0 h0b41bf4_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libxml2 2.10.4 hfdac1af_0 conda-forge libxslt 1.1.37 h873f0b0_0 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge livereload 2.6.3 pyh9f0ad1d_0 conda-forge llvm-openmp 16.0.4 h4dfa4b3_0 conda-forge llvmlite 0.40.0 py310h1b8f574_0 conda-forge locket 1.0.0 pyhd8ed1ab_0 conda-forge log-symbols 0.0.14 pypi_0 pypi lxml 4.9.2 py310hbdc0903_0 conda-forge lz4 4.3.2 py310h0cfdcf0_0 conda-forge lz4-c 1.9.4 hcb278e6_0 conda-forge makefun 1.15.1 pyhd8ed1ab_0 conda-forge markdown 3.4.3 pyhd8ed1ab_0 conda-forge markdown-it-py 2.2.0 pyhd8ed1ab_0 conda-forge markupsafe 2.1.2 py310h1fa729e_0 conda-forge matplotlib 3.7.1 pypi_0 pypi matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge mdit-py-plugins 0.3.5 pyhd8ed1ab_0 conda-forge mdurl 0.1.0 pyhd8ed1ab_0 conda-forge mimesis 10.1.0 pyhd8ed1ab_0 conda-forge mistune 2.0.5 pyhd8ed1ab_0 conda-forge mkl 2022.2.1 h84fe81f_16997 conda-forge mmh3 3.0.0 pypi_0 pypi moto 4.1.10 pyhd8ed1ab_0 conda-forge mpc 1.3.1 hfe3b2da_0 conda-forge mpfr 4.2.0 hb012696_0 conda-forge msgpack-python 1.0.5 py310hdf3cbec_0 conda-forge multidict 6.0.4 py310h1fa729e_0 conda-forge multiprocess 0.70.14 py310h5764c6d_3 conda-forge mypy-extensions 1.0.0 pypi_0 pypi myst-nb 0.17.2 pyhd8ed1ab_0 conda-forge myst-parser 0.18.1 pyhd8ed1ab_0 conda-forge nbclassic 1.0.0 pyhb4ecaf3_1 conda-forge nbclient 0.7.4 pyhd8ed1ab_0 conda-forge nbconvert 7.2.9 pyhd8ed1ab_0 conda-forge nbconvert-core 7.2.9 pyhd8ed1ab_0 conda-forge nbconvert-pandoc 7.2.9 pyhd8ed1ab_0 conda-forge nbformat 5.8.0 pyhd8ed1ab_0 conda-forge nbsphinx 0.9.2 pyhd8ed1ab_0 conda-forge ncurses 6.3 h27087fc_1 conda-forge nest-asyncio 1.5.6 pyhd8ed1ab_0 conda-forge networkx 2.8.8 pyhd8ed1ab_0 conda-forge ninja 1.11.1 h924138e_0 conda-forge no-implicit-optional 1.3 pypi_0 pypi nodeenv 1.8.0 pyhd8ed1ab_0 conda-forge notebook 6.5.4 pyha770c72_0 conda-forge notebook-shim 0.2.3 pyhd8ed1ab_0 conda-forge numba 0.57.0 py310h0f6aa51_0 conda-forge numpy 1.24.3 py310ha4c1d20_0 conda-forge numpydoc 1.5.0 pyhd8ed1ab_0 conda-forge nvcc_linux-64 11.8 h41dc85b_22 conda-forge nvtx 0.2.5 py310h1fa729e_0 conda-forge openapi-schema-validator 0.2.3 pyhd8ed1ab_0 conda-forge openapi-spec-validator 0.4.0 pyhd8ed1ab_1 conda-forge openjpeg 2.5.0 hfec8fc6_2 conda-forge openpyxl 3.1.2 pypi_0 pypi openssl 3.1.1 hd590300_1 conda-forge orc 1.8.3 hfdbbad2_0 conda-forge orderedset 2.0.3 pypi_0 pypi overrides 7.3.1 pyhd8ed1ab_0 conda-forge packaging 23.1 pyhd8ed1ab_0 conda-forge pandas 1.5.3 py310h9b08913_1 conda-forge pandoc 3.1.2 h32600fe_1 conda-forge pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge paramiko 3.2.0 pyhd8ed1ab_0 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.8.3 pyhd8ed1ab_0 conda-forge partd 1.4.0 pyhd8ed1ab_0 conda-forge path 16.6.0 pypi_0 pypi path-py 12.5.0 pypi_0 pypi pbr 5.11.1 pyhd8ed1ab_0 conda-forge pdbpp 0.10.3 pypi_0 pypi pexpect 4.8.0 pyh1a96a4e_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 9.5.0 py310h582fbeb_1 conda-forge pip 23.1.2 pyhd8ed1ab_0 conda-forge platformdirs 3.5.1 pyhd8ed1ab_0 conda-forge pluggy 1.0.0 pyhd8ed1ab_5 conda-forge pooch 1.7.0 pyha770c72_3 conda-forge pre-commit 3.3.2 pyha770c72_0 conda-forge prometheus_client 0.17.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.38 pyha770c72_0 conda-forge prompt_toolkit 3.0.38 hd8ed1ab_0 conda-forge protobuf 4.21.12 py310heca2aa9_0 conda-forge psutil 5.9.5 py310h1fa729e_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptxcompiler 0.8.1 py310h01a121a_0 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge py-cpuinfo 9.0.0 pyhd8ed1ab_0 conda-forge py-spy 0.3.14 pypi_0 pypi pyarrow 11.0.0 py310he6bfd7f_21_cpu conda-forge pyasn1 0.4.8 py_0 conda-forge pycparser 2.21 pyhd8ed1ab_0 conda-forge pydata-sphinx-theme 0.13.3 pyhd8ed1ab_0 conda-forge pygments 2.15.1 pyhd8ed1ab_0 conda-forge pyinstrument 4.4.0 pypi_0 pypi pynacl 1.5.0 py310h5764c6d_2 conda-forge pynvml 11.4.1 pyhd8ed1ab_0 conda-forge pyopenssl 23.2.0 pyhd8ed1ab_1 conda-forge pyorc 0.8.0 py310hd52fb3e_4 conda-forge pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge pyrepl 0.9.0 pypi_0 pypi pyrsistent 0.19.3 py310h1fa729e_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge pytest 7.3.1 pyhd8ed1ab_0 conda-forge pytest-benchmark 4.0.0 pyhd8ed1ab_0 conda-forge pytest-cases 3.6.14 pyhd8ed1ab_0 conda-forge pytest-cov 4.1.0 pyhd8ed1ab_0 conda-forge pytest-faulthandler 2.0.1 pypi_0 pypi pytest-repeat 0.9.1 pypi_0 pypi pytest-rerunfailures 11.1.2 pypi_0 pypi pytest-timeout 2.1.0 pypi_0 pypi pytest-xdist 3.3.1 pyhd8ed1ab_0 conda-forge python 3.10.11 he550d4f_0_cpython conda-forge python-confluent-kafka 1.9.2 py310h5764c6d_2 conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-fastjsonschema 2.17.1 pyhd8ed1ab_0 conda-forge python-jose 3.3.0 pyh6c4a22f_1 conda-forge python-json-logger 2.0.7 pyhd8ed1ab_0 conda-forge python-snappy 0.6.1 py310hcee4d7c_0 conda-forge python-xxhash 3.2.0 py310h1fa729e_0 conda-forge python_abi 3.10 3_cp310 conda-forge pytorch 1.11.0 py3.10_cpu_0 pytorch pytorch-mutex 1.0 cpu pytorch pytz 2023.3 pyhd8ed1ab_0 conda-forge pywin32-on-windows 0.1.0 pyh1179c8e_3 conda-forge pyyaml 6.0 py310h5764c6d_5 conda-forge pyzmq 25.1.0 py310h5bbb5d0_0 conda-forge rapids-dependency-file-generator 1.2.0 pypi_0 pypi rdma-core 28.9 h59595ed_1 conda-forge re2 2023.03.02 h8c504da_0 conda-forge readline 8.2 h8228510_1 conda-forge recommonmark 0.7.1 pyhd8ed1ab_0 conda-forge regex 2023.5.5 py310h2372a71_0 conda-forge remote-pdb 2.1.0 pypi_0 pypi requests 2.31.0 pyhd8ed1ab_0 conda-forge responses 0.18.0 pyhd8ed1ab_0 conda-forge rfc3339-validator 0.1.4 pyhd8ed1ab_0 conda-forge rfc3986-validator 0.1.1 pyh9f0ad1d_0 conda-forge rhash 1.4.3 h166bdaf_0 conda-forge rich 13.3.4 pypi_0 pypi rsa 4.9 pyhd8ed1ab_0 conda-forge s2n 1.3.44 h06160fa_0 conda-forge s3fs 2023.5.0 pyhd8ed1ab_0 conda-forge s3transfer 0.6.1 pyhd8ed1ab_0 conda-forge sacremoses 0.0.53 pyhd8ed1ab_0 conda-forge sarif-om 1.0.4 pyhd8ed1ab_0 conda-forge scalene 1.5.20 pypi_0 pypi scikit-build 0.17.5 pyh4af843d_0 conda-forge scipy 1.10.1 py310ha4c1d20_3 conda-forge seaborn 0.12.2 pypi_0 pypi sed 4.8 he412f7d_0 conda-forge send2trash 1.8.2 pyh41d4057_0 conda-forge setuptools 67.7.2 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge smmap 5.0.0 pypi_0 pypi snappy 1.1.10 h9fff704_0 conda-forge sniffio 1.3.0 pyhd8ed1ab_0 conda-forge snowballstemmer 2.2.0 pyhd8ed1ab_0 conda-forge sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge soupsieve 2.3.2.post1 pyhd8ed1ab_0 conda-forge spdlog 1.11.0 h9b3ece8_1 conda-forge sphinx 5.3.0 pyhd8ed1ab_0 conda-forge sphinx-autobuild 2021.3.14 pyhd8ed1ab_0 conda-forge sphinx-copybutton 0.5.2 pyhd8ed1ab_0 conda-forge sphinx-markdown-tables 0.0.17 pyh6c4a22f_0 conda-forge sphinxcontrib-applehelp 1.0.4 pyhd8ed1ab_0 conda-forge sphinxcontrib-devhelp 1.0.2 py_0 conda-forge sphinxcontrib-htmlhelp 2.0.1 pyhd8ed1ab_0 conda-forge sphinxcontrib-jsmath 1.0.1 py_0 conda-forge sphinxcontrib-qthelp 1.0.3 py_0 conda-forge sphinxcontrib-serializinghtml 1.1.5 pyhd8ed1ab_2 conda-forge sphinxcontrib-websupport 1.2.4 pyhd8ed1ab_1 conda-forge spinners 0.0.24 pypi_0 pypi sqlalchemy 2.0.15 py310h2372a71_0 conda-forge sshpubkeys 3.3.1 pyhd8ed1ab_0 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge streamz 0.6.4 pyh6c4a22f_0 conda-forge sysroot_linux-64 2.17 h4a8ded7_13 conda-forge tabulate 0.8.10 pypi_0 pypi tbb 2021.9.0 hf52228f_0 conda-forge tblib 1.7.0 pyhd8ed1ab_0 conda-forge termcolor 2.3.0 pypi_0 pypi terminado 0.17.1 pyh41d4057_0 conda-forge thrift 0.13.0 pypi_0 pypi tinycss2 1.2.1 pyhd8ed1ab_0 conda-forge tk 8.6.12 h27826a3_0 conda-forge tokenizers 0.13.1 py310h633acb5_2 conda-forge toml 0.10.2 pyhd8ed1ab_0 conda-forge tomli 2.0.1 pyhd8ed1ab_0 conda-forge toolz 0.12.0 pyhd8ed1ab_0 conda-forge tornado 6.3.2 py310h2372a71_0 conda-forge tqdm 4.65.0 pyhd8ed1ab_1 conda-forge traitlets 5.9.0 pyhd8ed1ab_0 conda-forge transformers 4.24.0 pyhd8ed1ab_0 conda-forge typing-extensions 4.6.2 hd8ed1ab_0 conda-forge typing-inspect 0.8.0 pypi_0 pypi typing_extensions 4.6.2 pyha770c72_0 conda-forge typing_utils 0.1.0 pyhd8ed1ab_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge ucx 1.14.1 h4a2ce2d_1 conda-forge ucxx 0.0.1 pypi_0 pypi ukkonen 1.0.1 py310hbf28c38_3 conda-forge urllib3 1.26.15 pyhd8ed1ab_0 conda-forge virtualenv 20.23.0 pyhd8ed1ab_0 conda-forge wcwidth 0.2.6 pyhd8ed1ab_0 conda-forge webencodings 0.5.1 py_1 conda-forge websocket-client 1.5.2 pyhd8ed1ab_0 conda-forge werkzeug 2.3.4 pyhd8ed1ab_0 conda-forge wheel 0.40.0 pyhd8ed1ab_0 conda-forge wmctrl 0.4 pypi_0 pypi wrapt 1.15.0 py310h1fa729e_0 conda-forge xdf 0.1 pypi_0 pypi xmltodict 0.13.0 pyhd8ed1ab_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xxhash 0.8.1 h0b41bf4_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge yarl 1.9.2 py310h2372a71_0 conda-forge zeromq 4.3.4 h9c3ff4c_1 conda-forge zict 3.0.0 pyhd8ed1ab_0 conda-forge zipp 3.15.0 pyhd8ed1ab_0 conda-forge zlib 1.2.13 h166bdaf_4 conda-forge zstd 1.5.2 h3eb15da_6 conda-forge
size=8192
looks like a magic number. With size=8192
I can't repro, as soon as size=8193
I reproduce consistently.
I can also reduce the number of partitions in the dask part to npartitions=2
and still reproduce (with size=8193
).
compute-sanitizer --tool racecheck python bug.py
reports something that might be relevant. It says
========= Error: Race reported between Write access at 0xe70 in void cudf::detail::valid_if_n_kernel<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::gather_bitmask_functor<(cudf::detail::gather_bitmask_op)2, thrust::transform_iterator<cudf::detail::gather(const cudf::table_view &, const cudf::column_view &, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)::[lambda(int) (instance 1)], cudf::detail::input_indexalator, thrust::use_default, thrust::use_default>>, (int)256>(T1, T2, T3, unsigned int **, int, int, int *)
========= and Read access at 0xef0 in void cudf::detail::valid_if_n_kernel<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::gather_bitmask_functor<(cudf::detail::gather_bitmask_op)2, thrust::transform_iterator<cudf::detail::gather(const cudf::table_view &, const cudf::column_view &, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)::[lambda(int) (instance 1)], cudf::detail::input_indexalator, thrust::use_default, thrust::use_default>>, (int)256>(T1, T2, T3, unsigned int **, int, int, int *) [924 hazards]
=========
valid_if_n_kernel
is only called from gather_bitmask
which is run during the _mimic_pandas_order - reindex
phase.
If I switch to using the scatter
-based implementation in _mimic_pandas_order
(diff in original issue description). Failures are still reproducible, and racecheck is similar. Here's a more detailed run with compute-sanitizer --tool racecheck --racecheck-report all python bug.py
(note that the code goes through scatter
which again calls gather_bitmask
):
========= Error: Potential WAR hazard detected at __shared__ 0x18 in block (6,0,0) :
========= Read Thread (6,0,0) at 0x590 in void cudf::detail::valid_if_n_kernel<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::gather_bitmask_functor<(cudf::detail::gather_bitmask_op)1, const int *>, (int)256>(T1, T2, T3, unsigned int **, int, int, int *)
========= Write Thread (192,0,0) at 0x510 in void cudf::detail::valid_if_n_kernel<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::gather_bitmask_functor<(cudf::detail::gather_bitmask_op)1, const int *>, (int)256>(T1, T2, T3, unsigned int **, int, int, int *)
========= Current Value : 16, Incoming Value : 16
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x304fd2]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x1488c]
========= in /usr/local/cuda/lib64/libcudart.so.11.0
========= Host Frame:cudaLaunchKernel [0x6c318]
========= in /usr/local/cuda/lib64/libcudart.so.11.0
========= Host Frame:void cudf::detail::gather_bitmask<(cudf::detail::gather_bitmask_op)1, int const*>(cudf::table_device_view, int const*, unsigned int**, int, int, int*, rmm::cuda_stream_view) [0x10e16e4]
========= in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
========= Host Frame:void cudf::detail::gather_bitmask<int const*>(cudf::table_view const&, int const*, std::vector<std::unique_ptr<cudf::column, std::default_delete<cudf::column> >, std::allocator<std::unique_ptr<cudf::column, std::default_delete<cudf::column> > > >&, cudf::detail::gather_bitmask_op, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x10e5138]
========= in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
========= Host Frame:cudf::detail::scatter(cudf::table_view const&, cudf::column_view const&, cudf::table_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x111513e]
========= in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
========= Host Frame:cudf::scatter(cudf::table_view const&, cudf::column_view const&, cudf::table_view const&, rmm::mr::device_memory_resource*) [0x11158cd]
========= in /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libcudf.so
========= Host Frame:__pyx_pf_4cudf_4_lib_7copying_12scatter(_object*, _object*, __pyx_obj_4cudf_4_lib_6column_Column*, _object*, bool) [clone .constprop.0] [0x4cb9b]
========= in /home/wence/Documents/src/rapids/cudf/python/cudf/cudf/_lib/copying.cpython-310-x86_64-linux-gnu.so
Reverting #13372 I cannot reproduce the bug, however gather_bitmask
is still not racecheck
-clean in that scenario.
Plausibly the racecheck errors are a(nother) false positive, if I compile with debug info, I get a little more information:
========= Error: Potential WAR hazard detected at __shared__ 0x8 in block (6,0,0) :
========= Read Thread (2,0,0) at 0xec0 in /home/wence/Documents/src/rapids/cudf/cpp/include/cudf/detail/utilities/
cuda.cuh:106:T3 cudf::detail::single_lane_block_sum_reduce<(int)256, (int)0, int>(T3)
========= Write Thread (64,0,0) at 0xc20 in /home/wence/Documents/src/rapids/cudf/cpp/include/cudf/detail/utilities/
cuda.cuh:98:T3 cudf::detail::single_lane_block_sum_reduce<(int)256, (int)0, int>(T3)
========= Current Value : 16, Incoming Value : 16
Here's that function with line numbers added:
87 template <int32_t block_size, int32_t leader_lane = 0, typename T>
88 __device__ T single_lane_block_sum_reduce(T lane_value)
89 {
90 static_assert(block_size <= 1024, "Invalid block size.");
91 static_assert(std::is_arithmetic_v<T>, "Invalid non-arithmetic type.");
92 constexpr auto warps_per_block{block_size / warp_size};
93 auto const lane_id{threadIdx.x % warp_size};
94 auto const warp_id{threadIdx.x / warp_size};
95 __shared__ T lane_values[warp_size];
96
97 // Load each lane's value into a shared memory array
98 if (lane_id == leader_lane) { lane_values[warp_id] = lane_value; }
99 __syncthreads();
100
101 // Use a single warp to do the reduction, result is only defined on
102 // threadId.x == 0
103 T result{0};
104 if (warp_id == 0) {
105 __shared__ typename cub::WarpReduce<T>::TempStorage temp;
106 lane_value = (lane_id < warps_per_block) ? lane_values[lane_id] : T{0};
107 result = cub::WarpReduce<T>(temp).Sum(lane_value);
108 }
109 return result;
110 }
So the claim is there's a WAR conflict between the write on line 98 and the read on line 106. But AIUI, the __syncthreads()
on line 99 is there exactly to avoid a WAR conflict.
Let's consider the calling context for single_lane_block_sum_reduce
(in valid_if_n_kernel
):
__global__ void valid_if_n_kernel(InputIterator1 begin1,
InputIterator2 begin2,
BinaryPredicate p,
bitmask_type* masks[],
size_type mask_count,
size_type mask_num_bits,
size_type* valid_counts)
{
for (size_type mask_idx = 0; mask_idx < mask_count; mask_idx++) {
auto const mask = masks[mask_idx];
if (mask == nullptr) { continue; }
auto block_offset = blockIdx.x * blockDim.x;
auto warp_valid_count = static_cast<size_type>(0);
while (block_offset < mask_num_bits) {
auto const thread_idx = block_offset + threadIdx.x;
auto const thread_active = thread_idx < mask_num_bits;
auto const arg_1 = *(begin1 + mask_idx);
auto const arg_2 = *(begin2 + thread_idx);
auto const bit_is_valid = thread_active && p(arg_1, arg_2);
auto const warp_validity = __ballot_sync(0xffff'ffffu, bit_is_valid);
auto const mask_idx = word_index(thread_idx);
if (thread_active && threadIdx.x % warp_size == 0) { mask[mask_idx] = warp_validity; }
warp_valid_count += __popc(warp_validity);
block_offset += blockDim.x * gridDim.x;
}
auto block_valid_count = single_lane_block_sum_reduce<block_size, 0>(warp_valid_count);
if (threadIdx.x == 0) { atomicAdd(valid_counts + mask_idx, block_valid_count); }
}
}
While the __syncthreads
call ensures that in the same iteration in valid_if_n_kernel
that there is no data-race between the write on line 98 and read on line 106, I don't think that is sufficient to ensure no data-races between iterations because without further (outside) synchronisation, the previous iteration's read on line 106 can race with the current iteration's write on line 98.
Applying this patch:
diff --git a/cpp/include/cudf/detail/utilities/cuda.cuh b/cpp/include/cudf/detail/utilities/cuda.cuh
index cdbc26701d..fe5ac6d42f 100644
--- a/cpp/include/cudf/detail/utilities/cuda.cuh
+++ b/cpp/include/cudf/detail/utilities/cuda.cuh
@@ -93,7 +93,7 @@ __device__ T single_lane_block_sum_reduce(T lane_value)
auto const lane_id{threadIdx.x % warp_size};
auto const warp_id{threadIdx.x / warp_size};
__shared__ T lane_values[warp_size];
-
+ __syncthreads();
// Load each lane's value into a shared memory array
if (lane_id == leader_lane) { lane_values[warp_id] = lane_value; }
__syncthreads();
Makes the reproducer racecheck-clean, good! But doesn't fix the original issue, bad!
@wence- I just traced this, following along with your messages above, and got the same conclusion. I would support including that one line patch in a standalone PR, though it doesn’t solve the main problem here.
I see something curious where setting a single element to null
in the input (rather than every other element) results in a runtime error. So in Lawrence's snippet above, I changed:
gdf_original["xx"] = gdf_original.xx.mask(
(np.arange(size, dtype="int32") % 2).astype("bool")
)
to:
gdf_original["xx"].iloc[8192] = None # 8192 not significant. 8000 fails too
and I get:
Traceback (most recent call last):
File "/home/ashwin/tmp.py", line 27, in <module>
cudf_result = gdf_grouped.cumcount()
File "/home/ashwin/miniconda3/envs/all_cuda-118_arch-x86_64/lib/python3.10/site-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/home/ashwin/workspace/cudf/python/cudf/cudf/core/groupby/groupby.py", line 404, in cumcount
.agg("cumcount")
File "/home/ashwin/workspace/cudf/python/cudf/cudf/core/groupby/groupby.py", line 2323, in agg
result = super().agg(func)
File "/home/ashwin/miniconda3/envs/all_cuda-118_arch-x86_64/lib/python3.10/site-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/home/ashwin/workspace/cudf/python/cudf/cudf/core/groupby/groupby.py", line 537, in agg
) = self._groupby.aggregate(columns, normalized_aggs)
File "groupby.pyx", line 325, in cudf._lib.groupby.GroupBy.aggregate
File "groupby.pyx", line 292, in cudf._lib.groupby.GroupBy.scan_internal
RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/copying/slice.cu:44: Slice range out of bounds.
@shwina I wasn't able to reproduce that RuntimeError with a single null, but @wence-'s reproducer that alternates null/valid does fail consistently after a small number of iterations (usually 2 or 4) for me.
I would support including that one line patch in a standalone PR, though it doesn’t solve the main problem here.
Opened #13485 to discuss approaches here (and for a second set of brains on whether or not my analysis is correct).
It seems there might be something funky going on with the construction/caching of the keys_bitmask_column
in sort_groupby_helper
.
If I apply this patch:
diff --git a/cpp/src/groupby/sort/sort_helper.cu b/cpp/src/groupby/sort/sort_helper.cu
index 082cf58ed2..c8257688bd 100644
--- a/cpp/src/groupby/sort/sort_helper.cu
+++ b/cpp/src/groupby/sort/sort_helper.cu
@@ -237,11 +237,16 @@ column_view sort_groupby_helper::unsorted_keys_labels(rmm::cuda_stream_view stre
column_view sort_groupby_helper::keys_bitmask_column(rmm::cuda_stream_view stream)
{
- if (_keys_bitmask_column) return _keys_bitmask_column->view();
-
auto [row_bitmask, null_count] =
cudf::detail::bitmask_and(_keys, stream, rmm::mr::get_current_device_resource());
+ if (_keys_bitmask_column) {
+ auto count = _keys_bitmask_column->null_count();
+
+ return _keys_bitmask_column->view();
+ }
+
+
_keys_bitmask_column = make_numeric_column(
data_type(type_id::INT8), _keys.num_rows(), std::move(row_bitmask), null_count, stream);
And break conditionally if count != null_count
. Then I see that on iteration 2 or 3 of the reproducer, the null_count
as seen by _keys_bitmask_column
is incorrect.
The offset_bitmask_binop
may be the culprit. Here is a diff to try.
diff --git a/cpp/include/cudf/detail/null_mask.cuh b/cpp/include/cudf/detail/null_mask.cuh
index 3ff3bb4cf3..35d0c7aa67 100644
--- a/cpp/include/cudf/detail/null_mask.cuh
+++ b/cpp/include/cudf/detail/null_mask.cuh
@@ -69,7 +69,10 @@ __global__ void offset_bitmask_binop(Binop op,
constexpr auto const word_size{detail::size_in_bits<bitmask_type>()};
auto const tid = threadIdx.x + blockIdx.x * blockDim.x;
- size_type thread_count = 0;
+ size_type thread_count = 0;
+ size_type const last_bit_index = source_size_bits - 1;
+ size_type const num_slack_bits = word_size - (last_bit_index % word_size) - 1;
+ size_type const last_word_index = cudf::word_index(last_bit_index);
for (size_type destination_word_index = tid; destination_word_index < destination.size();
destination_word_index += blockDim.x * gridDim.x) {
@@ -88,15 +91,8 @@ __global__ void offset_bitmask_binop(Binop op,
destination[destination_word_index] = destination_word;
thread_count += __popc(destination_word);
- }
-
- // Subtract any slack bits from the last word
- if (tid == 0) {
- size_type const last_bit_index = source_size_bits - 1;
- size_type const num_slack_bits = word_size - (last_bit_index % word_size) - 1;
- if (num_slack_bits > 0) {
- size_type const word_index = cudf::word_index(last_bit_index);
- thread_count -= __popc(destination[word_index] & set_most_significant_bits(num_slack_bits));
+ if (destination_word_index == last_word_index) {
+ thread_count -= __popc(destination_word & set_most_significant_bits(num_slack_bits));
}
}
That seems to do the trick for me, @davidwendt
I can confirm this appears to fix the bug for me, too. Let's open a PR. @davidwendt Would you do the honors?
Describe the bug
Running
with a recent enough cudf nightly sometimes produces assertion errors when checking correctness. The way this exhibits is that a few entries in the grouped dataframe columns are marked as NULL when they should not be. Post-mortem debugging, if one re-executes the offending bad code it tends to produce the correct result. One normally has to run a few times, interrupting the script to see the failure.
This has sometimes been causing the nightly actions to fail, the first is https://github.com/rapidsai/cudf/actions/runs/5065561229/jobs/9094279670 which is the first nightly that contained #13372.
Some existing investigation with @shwina provides the following information:
13389 fixed this code so that it would run at all, and introduces a
dataframe.reindex
call inside_mimic_pandas_order
. If we remove thereindex
call (applying this patchAnd uncommenting the alternate error checking code in the bug script, we do not see failures.
reindex
goes throughjoin
and hencegather
. However, we also tried reimplementing the reordering usingscatter
, like so:and still observe the bug.