Closed pxLi closed 3 years ago
Were there any expected updates needed in the JNI layer due to rapidsai/rmm#851 ? cc @jrhemstad @cwharris @harrism
I tried to reproduce this, both in a conda env with the latest RMM and cudf branch-21.10 code installed, as well as in the Centos 7 Docker environment under cudf/java/ci/
. I was unable to reproduce the issue in either case, even when running when disabling the RMM memory pooling (i.e.: using the CUDA default memory resource in RMM) and running the Java tests under cuda-memcheck.
@pxLi is there a specific GPU involved here or is there a way to setup a more interactive environment where this problem can be reproduced?
Were there any expected updates needed in the JNI layer due to rapidsai/rmm#851 ? cc @jrhemstad @cwharris @harrism
No, it was a non-breaking change. It's possible I could have introduced a runtime bug, but will need a repro in order to figure that out.
I tried to reproduce this, both in a conda env with the latest RMM and cudf branch-21.10 code installed, as well as in the Centos 7 Docker environment under
cudf/java/ci/
. I was unable to reproduce the issue in either case, even when running when disabling the RMM memory pooling (i.e.: using the CUDA default memory resource in RMM) and running the Java tests under cuda-memcheck.@pxLi is there a specific GPU involved here or is there a way to setup a more interactive environment where this problem can be reproduced?
@jlowe I could reproduce this on different instances w/ Tesla T4, V100, A30, TITAN V (most of our nightly CIs are using those) by just following the ci README doc. Our nightly cudfjni build pipeline 100% reproduce this
W/ cuda driver (460.32.03, 460.58, 465.07, 460.91.03, 470.57.02) + cuda toolkit 11.2.152, I saw the same issue
steps that could reproduce on our local dev machines (TITAN V) and our egx machines (Tesla T4),
docker build -f java/ci/Dockerfile.centos7 --build-arg CUDA_VERSION=11.2.2 -t cudf-build:11.2.2-devel-centos7 .
nvidia-docker run -it cudf-build:11.2.2-devel-centos7 bash
git clone --recursive https://github.com/rapidsai/cudf.git -b branch-21.10
cd cudf
export WORKSPACE=`pwd`
export PARALLEL_LEVEL=16
export SKIP_JAVA_TESTS=false
scl enable devtoolset-9 "java/ci/build-in-docker.sh"
Ah, I think the key missing piece in my setup was per-thread-default-stream. I had inadvertently built with PTDS off, but when I build with PTDS ON, I'm able to reproduce the issue.
More than that, most C++ tests fail with PTDS enabled. Here's how I'm building it in a conda environment with the latest RMM source installed in it. Note that I'm using a V100:
cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DPER_THREAD_DEFAULT_STREAM=ON -DCUDF_USE_ARROW_STATIC=ON -DCUDF_ENABLE_ARROW_S3=OFF -DUSE_NVTX=1 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=ON ..
Per @pxLi's analysis, I'm also able to verify that the C++ tests and Java unit tests all pass even with PTDS enabled if I keep the cudf source the same but install RMM from commit b458233ae5b41314ad474c88c60069918c52c7b3 which is just before rapidsai/rmm#851. Therefore it does seem related to that RMM change.
Given that this isn't localized to just the Java bindings but is also seen in the C++ unit tests as long as PTDS is enabled, I'll update the headline accordingly.
Some sample failures I saw when running the unit tests on a build with PTDS=ON:
[ RUN ] TypedScalarTest/8.ConstructNull
../tests/scalar/scalar_test.cpp:50: Failure
Value of: s.is_valid()
Actual: true
Expected: false
[ FAILED ] TypedScalarTest/8.ConstructNull, where TypeParam = bool (0 ms)
[ RUN ] TypedScalarTest/9.DefaultValidity
../tests/scalar/scalar_test.cpp:41: Failure
Value of: s.is_valid()
Actual: false
Expected: true
[ FAILED ] TypedScalarTest/9.DefaultValidity, where TypeParam = float (0 ms)
And test failures like this which imply there's a race condition:
[ RUN ] TypedScalarTestWithoutFixedPoint/9.SetValue
../tests/scalar/scalar_test.cpp:61: Failure
Expected equality of these values:
value
Which is: 9
s.value()
Which is: 9
[ FAILED ] TypedScalarTestWithoutFixedPoint/9.SetValue, where TypeParam = float (0 ms)
Along with some tests that crash with an invalid address:
[ RUN ] BinaryOperationNullTest.Vector_Null_Vector_Valid
unknown file: Failure
C++ exception with description "CUDA error at: /home/jlowe/miniconda3/envs/cudf_dev/include/rmm/cuda_stream_view.hpp:95: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[ FAILED ] BinaryOperationNullTest.Vector_Null_Vector_Valid (109 ms)
I can have a look today.
~I am unable to reproduce locally. I built with all the flags you specified above, with the latest 21.10 RMM, and ran on my V100 machine. All tests pass. :(~
I have a repro.
The problem is that when there are multiple free lists, the same pointer is being returned multiple times. I believe that this is a bug that has been latent since before #851, because it seems to be caused by using auto
rather than auto&
so we are copying an active free_list rather than referencing it, so after the call the free list returns to its previous state.
It was previously hidden by the fact that previously we always merged free lists after stealing.
Describe the bug cudfjni UT failed in nightly build, could be related to https://github.com/rapidsai/rmm/pull/851
we saw random assert failures in different runs,
all other tests failed w/
ai.rapids.cudf.CudaException: an illegal memory access was encountered
orai.rapids.cudf.CudaException: misaligned address
08/28 run w/ commit 1d4a2fb09a89d591af0b0df7706866b9b67b5b47
08/29 run w/ commit at 1d4a2fb09a89d591af0b0df7706866b9b67b5b47
Steps/Code to reproduce bug Follow https://github.com/rapidsai/cudf/blob/branch-21.10/java/ci/README.md and run w/
SKIP_JAVA_TESTS=false
Expected behavior UT passed w/ no failures/errors