rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[BUG] Diagnose `pytest cuml-dask` hang in CUDA 12.5 wheel CI tests #6050

Closed divyegala closed 1 month ago

divyegala commented 2 months ago

First reported by @jakirkham in PR https://github.com/rapidsai/cuml/pull/6031, we are now seeing that pytest dask-cuml hangs in CUDA 12.5 wheel CI jobs. Until we figure out the root cause of this issue, we will be temporarily disabling that test suite.

Reference to the hang: CI job link

cc @dantegd

jakirkham commented 2 months ago

It is worth noting this hang can also be seen in a no change PR: https://github.com/rapidsai/cuml/pull/6047#issuecomment-2316020277

viclafargue commented 2 months ago

I cannot reproduce the issue anymore on an L40 machine with the latest commit hash. But, for those interested in reproducing it elsewhere, here are some commands :

docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -u 0 --entrypoint bash rapidsai/citestwheel:cuda12.5.1-ubuntu22.04-py3.10

export RAPIDS_BUILD_TYPE=nightly
export RAPIDS_REPOSITORY=rapidsai/cuml
export RAPIDS_REF_NAME=branch-24.10
export RAPIDS_SHA=d87b0ce
export RAPIDS_NIGHTLY_DATE=2024-09-05

git clone https://github.com/rapidsai/gha-tools.git
(git clone https://github.com/rapidsai/cuml.git && cd cuml && git checkout $RAPIDS_SHA)

mkdir -p ./dist
GHA_TOOLS_DIR=gha-tools/tools
RAPIDS_PY_CUDA_SUFFIX="$($GHA_TOOLS_DIR/rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
RAPIDS_PY_WHEEL_NAME="cuml_${RAPIDS_PY_CUDA_SUFFIX}" $GHA_TOOLS_DIR/rapids-download-wheels-from-s3 ./dist

python -m pip install $(echo ./dist/cuml*.whl)[test]

bash cuml/ci/run_cuml_dask_pytests.sh
divyegala commented 1 month ago

Closed by #6051