rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 883 forks source link

[BUG] Invalid memory access in dask cudf concat. #7722

Closed trivialfis closed 3 years ago

trivialfis commented 3 years ago

This has happened in our CI a few times but I haven't been able to reproduce it deterministically yet. At the end is the error log copied from one of the failing logs from Jenkins. Opening an issue and see if there's someone else who can reproduce it more deterministically. Sorry for the noise.

The used cudf version is 0.18.

[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] terminate called after throwing an instance of 'thrust::system::system_error'
[2021-03-25T08:11:20.565Z]   what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
[2021-03-25T08:11:20.565Z] Fatal Python error: Aborted
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36acffd700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 300 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 179 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36ad7fe700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36adfff700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f0f89700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f3ffb700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f47fc700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f57fe700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f5fff700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36ff46b700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f3702ffd700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 300 in wait
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 179 in get
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f37037fe700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/profile.py", line 269 in _watch
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f3703fff700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/selectors.py", line 468 in select
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/asyncio/base_events.py", line 1750 in _run_once
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/asyncio/base_events.py", line 541 in run_forever
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199 in start
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/utils.py", line 428 in run_loop
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f3803fff700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/concurrent/futures/thread.py", line 78 in _worker
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Current thread 0x00007f3a53a72740 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 278 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/series.py", line 1734 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/reshape.py", line 378 in concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask_cudf/backends.py", line 210 in concat_cudf
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/methods.py", line 422 in concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py", line 102 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py", line 107 in finalize
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 566 in <listcomp>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 566 in compute
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 283 in compute
[2021-03-25T08:11:20.566Z]   File "tests/python/test_with_dask.py", line 186 in run_boost_from_prediction
[2021-03-25T08:11:20.566Z]   File "/workspace/tests/python-gpu/test_gpu_with_dask.py", line 183 in test_boost_from_prediction
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/python.py", line 1641 in runtest
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 255 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 311 in from_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 215 in call_and_report
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 126 in runtestprotocol
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 323 in _main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 269 in wrap_session
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/config/__init__.py", line 163 in main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/config/__init__.py", line 185 in console_main
trivialfis commented 3 years ago

cc @hcho3

kkraus14 commented 3 years ago

@trivialfis any chance you can run the workload through cuda-memcheck and see if it pops out anything actionable? Right now there's not really any information for us to act on here.

trivialfis commented 3 years ago

@kkraus14 Thanks for the reply, I will try to run it next week.

trivialfis commented 3 years ago

Em .. running it locally with memcheck/sanitizer on my 2 GPUs system works fine, but it's somehow easier to reproduce on CI (another log). But even on CI, it's still not deterministically reproducible. I will try to run it on other clusters.

kkraus14 commented 3 years ago

Hmm, well it looks to be concatenation related:

[2021-03-29T12:36:59.078Z] Current thread 0x00007f0381574740 (most recent call first):
[2021-03-29T12:36:59.078Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 278 in _concat
[2021-03-29T12:36:59.078Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/series.py", line 1734 in _concat
[2021-03-29T12:36:59.078Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/reshape.py", line 378 in concat
trivialfis commented 3 years ago

Let's see if it's caused by https://github.com/dmlc/xgboost/pull/6798 .

trivialfis commented 3 years ago

Let's see if it's caused by dmlc/xgboost#6798 .

Nope.

trivialfis commented 3 years ago

Hi @kkraus14 . I got a memcheck log by running it on our CI. Hopefully that can clear things up a little bit. log.txt

Full log in here: https://xgboost-ci.net/blue/rest/organizations/jenkins/pipelines/xgboost/branches/PR-6815/runs/3/nodes/226/log/?start=0

CUAD: 10.2 Conda env:

    conda create -n gpu_test -c rapidsai-nightly -c rapidsai -c nvidia -c conda-forge -c defaults \
        python=3.7 cudf=0.18* rmm=0.18* cudatoolkit=10.2 dask dask-cuda dask-cudf cupy \
        numpy pytest scipy scikit-learn pandas matplotlib wheel python-kubernetes urllib3 graphviz hypothesis
kkraus14 commented 3 years ago

Relevant piece:

2021-03-31T19:09:52.917Z] ========= Invalid __global__ read of size 8
[2021-03-31T19:09:52.917Z] =========     at 0x00000460 in void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*)
[2021-03-31T19:09:52.917Z] =========     by thread (19,0,0) in block (2,0,0)
[2021-03-31T19:09:52.917Z] =========     Address 0x7f471c0006f8 is out of bounds
[2021-03-31T19:09:52.917Z] =========     Device Frame:void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*) (void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*) : 0x460)
trivialfis commented 3 years ago

The used input is a small dataset from skl:

def test_boost_from_prediction(local_cuda_cluster: LocalCUDACluster) -> None:
    import cudf
    from sklearn.datasets import load_breast_cancer
    with Client(local_cuda_cluster) as client:
        X_, y_ = load_breast_cancer(return_X_y=True)
        X = dd.from_array(X_, chunksize=100).map_partitions(cudf.from_pandas)
        y = dd.from_array(y_, chunksize=100).map_partitions(cudf.from_pandas)

Concatenation is used to get all data parts within each worker into 1.

kkraus14 commented 3 years ago

@trivialfis how many workers is CI running in this situation? Could you point us to the code that xgboost uses to concatenate the parts on each worker?

trivialfis commented 3 years ago

Thanks.

how many workers is CI running in this situation?

8 GPUs I believe.

Could you point us to the code that xgboost uses to concatenate the parts on each worker?

This is the concat function: https://github.com/dmlc/xgboost/blob/905fdd3e08d91077aada776346c7e49e4ff69334/python-package/xgboost/dask.py#L170 called by https://github.com/dmlc/xgboost/blob/905fdd3e08d91077aada776346c7e49e4ff69334/python-package/xgboost/dask.py#L743 .

The test starts from here: https://github.com/dmlc/xgboost/blob/905fdd3e08d91077aada776346c7e49e4ff69334/tests/python-gpu/test_gpu_with_dask.py#L176 and the actual logic of this test is defined in here: https://github.com/dmlc/xgboost/blob/905fdd3e08d91077aada776346c7e49e4ff69334/tests/python/test_with_dask.py#L165 (for sharing between CPU and GPU)

Docker file: https://github.com/dmlc/xgboost/blob/master/tests/ci_build/Dockerfile.gpu

trivialfis commented 3 years ago

@hcho3 tried to reenable the test by upgrading the nightly https://github.com/dmlc/xgboost/pull/6825. A new error trace appears and cupy is also complaining of memory error. I'm suspecting something new in arrow that's not handled in downstream projects since both errors come from construction (map_blocks/map_partitions). Or is it somewhere in rmm?

From cuDF

``` [2021-04-06T04:11:56.613Z] self = Dask DataFrame Structure: [2021-04-06T04:11:56.613Z] 0 1 2 3 4 5 6 7 ... ... ... ... ... ... ... ... ... ... ... [2021-04-06T04:11:56.613Z] Dask Name: from-dask, 100 tasks [2021-04-06T04:11:56.613Z] func = , args = (), kwargs = {} [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] @insert_meta_param_description(pad=12) [2021-04-06T04:11:56.613Z] def map_partitions(self, func, *args, **kwargs): [2021-04-06T04:11:56.613Z] """Apply Python function on each DataFrame partition. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Note that the index and divisions are assumed to remain unchanged. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Parameters [2021-04-06T04:11:56.613Z] ---------- [2021-04-06T04:11:56.613Z] func : function [2021-04-06T04:11:56.613Z] Function applied to each partition. [2021-04-06T04:11:56.613Z] args, kwargs : [2021-04-06T04:11:56.613Z] Arguments and keywords to pass to the function. The partition will [2021-04-06T04:11:56.613Z] be the first argument, and these will be passed *after*. Arguments [2021-04-06T04:11:56.613Z] and keywords may contain ``Scalar``, ``Delayed``, ``partition_info`` [2021-04-06T04:11:56.613Z] or regular python objects. DataFrame-like args (both dask and [2021-04-06T04:11:56.613Z] pandas) will be repartitioned to align (if necessary) before [2021-04-06T04:11:56.613Z] applying the function. [2021-04-06T04:11:56.613Z] $META [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Examples [2021-04-06T04:11:56.613Z] -------- [2021-04-06T04:11:56.613Z] Given a DataFrame, Series, or Index, such as: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> import pandas as pd [2021-04-06T04:11:56.613Z] >>> import dask.dataframe as dd [2021-04-06T04:11:56.613Z] >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], [2021-04-06T04:11:56.613Z] ... 'y': [1., 2., 3., 4., 5.]}) [2021-04-06T04:11:56.613Z] >>> ddf = dd.from_pandas(df, npartitions=2) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] One can use ``map_partitions`` to apply a function on each partition. [2021-04-06T04:11:56.613Z] Extra arguments and keywords can optionally be provided, and will be [2021-04-06T04:11:56.613Z] passed to the function after the partition. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Here we apply a function with arguments and keywords to a DataFrame, [2021-04-06T04:11:56.613Z] resulting in a Series: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> def myadd(df, a, b=1): [2021-04-06T04:11:56.613Z] ... return df.x + df.y + a + b [2021-04-06T04:11:56.613Z] >>> res = ddf.map_partitions(myadd, 1, b=2) [2021-04-06T04:11:56.613Z] >>> res.dtype [2021-04-06T04:11:56.613Z] dtype('float64') [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] By default, dask tries to infer the output metadata by running your [2021-04-06T04:11:56.613Z] provided function on some fake data. This works well in many cases, but [2021-04-06T04:11:56.613Z] can sometimes be expensive, or even fail. To avoid this, you can [2021-04-06T04:11:56.613Z] manually specify the output metadata with the ``meta`` keyword. This [2021-04-06T04:11:56.613Z] can be specified in many forms, for more information see [2021-04-06T04:11:56.613Z] ``dask.dataframe.utils.make_meta``. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Here we specify the output is a Series with no name, and dtype [2021-04-06T04:11:56.613Z] ``float64``: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8')) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Here we map a function that takes in a DataFrame, and returns a [2021-04-06T04:11:56.613Z] DataFrame with a new column: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) [2021-04-06T04:11:56.613Z] >>> res.dtypes [2021-04-06T04:11:56.613Z] x int64 [2021-04-06T04:11:56.613Z] y float64 [2021-04-06T04:11:56.613Z] z float64 [2021-04-06T04:11:56.613Z] dtype: object [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] As before, the output metadata can also be specified manually. This [2021-04-06T04:11:56.613Z] time we pass in a ``dict``, as the output is a DataFrame: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), [2021-04-06T04:11:56.613Z] ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] In the case where the metadata doesn't change, you can also pass in [2021-04-06T04:11:56.613Z] the object itself directly: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Also note that the index and divisions are assumed to remain unchanged. [2021-04-06T04:11:56.613Z] If the function you're mapping changes the index/divisions, you'll need [2021-04-06T04:11:56.613Z] to clear them afterwards: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Your map function gets information about where it is in the dataframe by [2021-04-06T04:11:56.613Z] accepting a special ``partition_info`` keyword argument. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> def func(partition, partition_info=None): [2021-04-06T04:11:56.613Z] ... pass [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] This will receive the following information: [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] >>> partition_info # doctest: +SKIP [2021-04-06T04:11:56.613Z] {'number': 1, 'division': 3} [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] For each argument and keyword arguments that are dask dataframes you will [2021-04-06T04:11:56.613Z] receive the number (n) which represents the nth partition of the dataframe [2021-04-06T04:11:56.613Z] and the division (the first index value in the partition). If divisions [2021-04-06T04:11:56.613Z] are not known (for instance if the index is not sorted) then you will get [2021-04-06T04:11:56.613Z] None as the division. [2021-04-06T04:11:56.613Z] """ [2021-04-06T04:11:56.613Z] > return map_partitions(func, self, *args, **kwargs) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] /opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py:684: [2021-04-06T04:11:56.613Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] func = , meta = '__no_default__' [2021-04-06T04:11:56.613Z] enforce_metadata = True, transform_divisions = True [2021-04-06T04:11:56.613Z] args = [Dask DataFrame Structure: [2021-04-06T04:11:56.613Z] 0 1 2 3 4 5 6 7 ... ... ... ... ... ... ... ... ... ... ... [2021-04-06T04:11:56.613Z] Dask Name: from-dask, 100 tasks] [2021-04-06T04:11:56.613Z] kwargs = {}, name = 'from_pandas-d540f74108b63377aedd2e91d5b25565' [2021-04-06T04:11:56.613Z] token = 'd540f74108b63377aedd2e91d5b25565' [2021-04-06T04:11:56.613Z] _maybe_align_partitions = [2021-04-06T04:11:56.613Z] meta_index = RangeIndex(start=0, stop=0, step=1) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] @insert_meta_param_description [2021-04-06T04:11:56.613Z] def map_partitions( [2021-04-06T04:11:56.613Z] func, [2021-04-06T04:11:56.613Z] *args, [2021-04-06T04:11:56.613Z] meta=no_default, [2021-04-06T04:11:56.613Z] enforce_metadata=True, [2021-04-06T04:11:56.613Z] transform_divisions=True, [2021-04-06T04:11:56.613Z] **kwargs, [2021-04-06T04:11:56.613Z] ): [2021-04-06T04:11:56.613Z] """Apply Python function on each DataFrame partition. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Parameters [2021-04-06T04:11:56.613Z] ---------- [2021-04-06T04:11:56.613Z] func : function [2021-04-06T04:11:56.613Z] Function applied to each partition. [2021-04-06T04:11:56.613Z] args, kwargs : [2021-04-06T04:11:56.613Z] Arguments and keywords to pass to the function. At least one of the [2021-04-06T04:11:56.613Z] args should be a Dask.dataframe. Arguments and keywords may contain [2021-04-06T04:11:56.613Z] ``Scalar``, ``Delayed`` or regular python objects. DataFrame-like args [2021-04-06T04:11:56.613Z] (both dask and pandas) will be repartitioned to align (if necessary) [2021-04-06T04:11:56.613Z] before applying the function. [2021-04-06T04:11:56.613Z] enforce_metadata : bool [2021-04-06T04:11:56.613Z] Whether or not to enforce the structure of the metadata at runtime. [2021-04-06T04:11:56.613Z] This will rename and reorder columns for each partition, [2021-04-06T04:11:56.613Z] and will raise an error if this doesn't work or types don't match. [2021-04-06T04:11:56.613Z] $META [2021-04-06T04:11:56.613Z] """ [2021-04-06T04:11:56.613Z] name = kwargs.pop("token", None) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] if has_keyword(func, "partition_info"): [2021-04-06T04:11:56.613Z] kwargs["partition_info"] = {"number": -1, "divisions": None} [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] assert callable(func) [2021-04-06T04:11:56.613Z] if name is not None: [2021-04-06T04:11:56.613Z] token = tokenize(meta, *args, **kwargs) [2021-04-06T04:11:56.613Z] else: [2021-04-06T04:11:56.613Z] name = funcname(func) [2021-04-06T04:11:56.613Z] token = tokenize(func, meta, *args, **kwargs) [2021-04-06T04:11:56.613Z] name = "{0}-{1}".format(name, token) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] from .multi import _maybe_align_partitions [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] args = _maybe_from_pandas(args) [2021-04-06T04:11:56.613Z] args = _maybe_align_partitions(args) [2021-04-06T04:11:56.613Z] dfs = [df for df in args if isinstance(df, _Frame)] [2021-04-06T04:11:56.613Z] meta_index = getattr(make_meta(dfs[0]), "index", None) if dfs else None [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] if meta is no_default: [2021-04-06T04:11:56.613Z] # Use non-normalized kwargs here, as we want the real values (not [2021-04-06T04:11:56.613Z] # delayed values) [2021-04-06T04:11:56.613Z] > meta = _emulate(func, *args, udf=True, **kwargs) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] /opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py:5561: [2021-04-06T04:11:56.613Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] func = [2021-04-06T04:11:56.613Z] args = (Dask DataFrame Structure: [2021-04-06T04:11:56.613Z] 0 1 2 3 4 5 6 7 ... ... ... ... ... ... ... ... ... ... ... [2021-04-06T04:11:56.613Z] Dask Name: from-dask, 100 tasks,) [2021-04-06T04:11:56.613Z] kwargs = {} [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] def _emulate(func, *args, **kwargs): [2021-04-06T04:11:56.613Z] """ [2021-04-06T04:11:56.613Z] Apply a function using args / kwargs. If arguments contain dd.DataFrame / [2021-04-06T04:11:56.613Z] dd.Series, using internal cache (``_meta``) for calculation [2021-04-06T04:11:56.613Z] """ [2021-04-06T04:11:56.613Z] with raise_on_meta_error(funcname(func), udf=kwargs.pop("udf", False)): [2021-04-06T04:11:56.613Z] > return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] /opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py:5508: [2021-04-06T04:11:56.613Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] self = [2021-04-06T04:11:56.613Z] type = [2021-04-06T04:11:56.613Z] value = RuntimeError('CUDA error encountered at: ../src/interop/from_arrow.cpp:141: 700 cudaErrorIllegalAddress an illegal memory access was encountered') [2021-04-06T04:11:56.613Z] traceback = [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] def __exit__(self, type, value, traceback): [2021-04-06T04:11:56.613Z] if type is None: [2021-04-06T04:11:56.613Z] try: [2021-04-06T04:11:56.613Z] next(self.gen) [2021-04-06T04:11:56.613Z] except StopIteration: [2021-04-06T04:11:56.613Z] return False [2021-04-06T04:11:56.613Z] else: [2021-04-06T04:11:56.613Z] raise RuntimeError("generator didn't stop") [2021-04-06T04:11:56.613Z] else: [2021-04-06T04:11:56.613Z] if value is None: [2021-04-06T04:11:56.613Z] # Need to force instantiation so we can reliably [2021-04-06T04:11:56.613Z] # tell if we get the same exception back [2021-04-06T04:11:56.613Z] value = type() [2021-04-06T04:11:56.613Z] try: [2021-04-06T04:11:56.613Z] > self.gen.throw(type, value, traceback) [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] /opt/python/envs/gpu_test/lib/python3.7/contextlib.py:130: [2021-04-06T04:11:56.613Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] funcname = 'from_pandas', udf = True [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] @contextmanager [2021-04-06T04:11:56.613Z] def raise_on_meta_error(funcname=None, udf=False): [2021-04-06T04:11:56.613Z] """Reraise errors in this block to show metadata inference failure. [2021-04-06T04:11:56.613Z] [2021-04-06T04:11:56.613Z] Parameters [2021-04-06T04:11:56.613Z] ---------- [2021-04-06T04:11:56.613Z] funcname : str, optional [2021-04-06T04:11:56.613Z] If provided, will be added to the error message to indicate the [2021-04-06T04:11:56.613Z] name of the method that failed. [2021-04-06T04:11:56.613Z] """ [2021-04-06T04:11:56.613Z] try: [2021-04-06T04:11:56.613Z] yield [2021-04-06T04:11:56.613Z] except Exception as e: [2021-04-06T04:11:56.613Z] exc_type, exc_value, exc_traceback = sys.exc_info() [2021-04-06T04:11:56.613Z] tb = "".join(traceback.format_tb(exc_traceback)) [2021-04-06T04:11:56.613Z] msg = "Metadata inference failed{0}.\n\n" [2021-04-06T04:11:56.613Z] if udf: [2021-04-06T04:11:56.613Z] msg += ( [2021-04-06T04:11:56.613Z] "You have supplied a custom function and Dask is unable to \n" [2021-04-06T04:11:56.613Z] "determine the type of output that that function returns. \n\n" [2021-04-06T04:11:56.613Z] "To resolve this please provide a meta= keyword.\n" [2021-04-06T04:11:56.613Z] "The docstring of the Dask function you ran should have more information.\n\n" [2021-04-06T04:11:56.613Z] ) [2021-04-06T04:11:56.613Z] msg += ( [2021-04-06T04:11:56.613Z] "Original error is below:\n" [2021-04-06T04:11:56.613Z] "------------------------\n" [2021-04-06T04:11:56.613Z] "{1}\n\n" [2021-04-06T04:11:56.613Z] "Traceback:\n" [2021-04-06T04:11:56.613Z] "---------\n" [2021-04-06T04:11:56.613Z] "{2}" [2021-04-06T04:11:56.613Z] ) [2021-04-06T04:11:56.613Z] msg = msg.format(" in `{0}`".format(funcname) if funcname else "", repr(e), tb) [2021-04-06T04:11:56.613Z] > raise ValueError(msg) from e [2021-04-06T04:11:56.613Z] E ValueError: Metadata inference failed in `from_pandas`. [2021-04-06T04:11:56.613Z] E [2021-04-06T04:11:56.613Z] E You have supplied a custom function and Dask is unable to [2021-04-06T04:11:56.613Z] E determine the type of output that that function returns. [2021-04-06T04:11:56.613Z] E [2021-04-06T04:11:56.613Z] E To resolve this please provide a meta= keyword. [2021-04-06T04:11:56.613Z] E The docstring of the Dask function you ran should have more information. [2021-04-06T04:11:56.613Z] E [2021-04-06T04:11:56.613Z] E Original error is below: [2021-04-06T04:11:56.613Z] E ------------------------ [2021-04-06T04:11:56.613Z] E RuntimeError('CUDA error encountered at: ../src/interop/from_arrow.cpp:141: 700 cudaErrorIllegalAddress an illegal memory access was encountered') [2021-04-06T04:11:56.613Z] E [2021-04-06T04:11:56.613Z] E Traceback: [2021-04-06T04:11:56.613Z] E --------- [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/utils.py", line 180, in raise_on_meta_error [2021-04-06T04:11:56.614Z] E yield [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py", line 5508, in _emulate [2021-04-06T04:11:56.614Z] E return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/dataframe.py", line 7865, in from_pandas [2021-04-06T04:11:56.614Z] E return DataFrame.from_pandas(obj, nan_as_null=nan_as_null) [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/dataframe.py", line 5527, in from_pandas [2021-04-06T04:11:56.614Z] E col_value.array, nan_as_null=nan_as_null [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 2087, in as_column [2021-04-06T04:11:56.614Z] E nan_as_null=nan_as_null, [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 1881, in as_column [2021-04-06T04:11:56.614Z] E col = ColumnBase.from_arrow(arbitrary) [2021-04-06T04:11:56.614Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 433, in from_arrow [2021-04-06T04:11:56.614Z] E return libcudf.interop.from_arrow(data, data.column_names)._data[ [2021-04-06T04:11:56.614Z] E File "cudf/_lib/interop.pyx", line 167, in cudf._lib.interop.from_arrow ```

From cupy

``` [2021-04-06T04:11:56.618Z] def map_blocks( [2021-04-06T04:11:56.618Z] func, [2021-04-06T04:11:56.618Z] *args, [2021-04-06T04:11:56.618Z] name=None, [2021-04-06T04:11:56.618Z] token=None, [2021-04-06T04:11:56.618Z] dtype=None, [2021-04-06T04:11:56.618Z] chunks=None, [2021-04-06T04:11:56.618Z] drop_axis=[], [2021-04-06T04:11:56.618Z] new_axis=None, [2021-04-06T04:11:56.618Z] meta=None, [2021-04-06T04:11:56.618Z] **kwargs, [2021-04-06T04:11:56.618Z] ): [2021-04-06T04:11:56.618Z] """Map a function across all blocks of a dask array. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] Note that ``map_blocks`` will attempt to automatically determine the output [2021-04-06T04:11:56.618Z] array type by calling ``func`` on 0-d versions of the inputs. Please refer to [2021-04-06T04:11:56.618Z] the ``meta`` keyword argument below if you expect that the function will not [2021-04-06T04:11:56.618Z] succeed when operating on 0-d arrays. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] Parameters [2021-04-06T04:11:56.618Z] ---------- [2021-04-06T04:11:56.618Z] func : callable [2021-04-06T04:11:56.618Z] Function to apply to every block in the array. [2021-04-06T04:11:56.618Z] args : dask arrays or other objects [2021-04-06T04:11:56.618Z] dtype : np.dtype, optional [2021-04-06T04:11:56.618Z] The ``dtype`` of the output array. It is recommended to provide this. [2021-04-06T04:11:56.618Z] If not provided, will be inferred by applying the function to a small [2021-04-06T04:11:56.618Z] set of fake data. [2021-04-06T04:11:56.618Z] chunks : tuple, optional [2021-04-06T04:11:56.618Z] Chunk shape of resulting blocks if the function does not preserve [2021-04-06T04:11:56.618Z] shape. If not provided, the resulting array is assumed to have the same [2021-04-06T04:11:56.618Z] block structure as the first input array. [2021-04-06T04:11:56.618Z] drop_axis : number or iterable, optional [2021-04-06T04:11:56.618Z] Dimensions lost by the function. [2021-04-06T04:11:56.618Z] new_axis : number or iterable, optional [2021-04-06T04:11:56.618Z] New dimensions created by the function. Note that these are applied [2021-04-06T04:11:56.618Z] after ``drop_axis`` (if present). [2021-04-06T04:11:56.618Z] token : string, optional [2021-04-06T04:11:56.618Z] The key prefix to use for the output array. If not provided, will be [2021-04-06T04:11:56.618Z] determined from the function name. [2021-04-06T04:11:56.618Z] name : string, optional [2021-04-06T04:11:56.618Z] The key name to use for the output array. Note that this fully [2021-04-06T04:11:56.618Z] specifies the output key name, and must be unique. If not provided, [2021-04-06T04:11:56.618Z] will be determined by a hash of the arguments. [2021-04-06T04:11:56.618Z] meta : array-like, optional [2021-04-06T04:11:56.618Z] The ``meta`` of the output array, when specified is expected to be an [2021-04-06T04:11:56.618Z] array of the same type and dtype of that returned when calling ``.compute()`` [2021-04-06T04:11:56.618Z] on the array returned by this function. When not provided, ``meta`` will be [2021-04-06T04:11:56.618Z] inferred by applying the function to a small set of fake data, usually a [2021-04-06T04:11:56.618Z] 0-d array. It's important to ensure that ``func`` can successfully complete [2021-04-06T04:11:56.618Z] computation without raising exceptions when 0-d is passed to it, providing [2021-04-06T04:11:56.618Z] ``meta`` will be required otherwise. If the output type is known beforehand [2021-04-06T04:11:56.618Z] (e.g., ``np.ndarray``, ``cupy.ndarray``), an empty array of such type dtype [2021-04-06T04:11:56.618Z] can be passed, for example: ``meta=np.array((), dtype=np.int32)``. [2021-04-06T04:11:56.618Z] **kwargs : [2021-04-06T04:11:56.618Z] Other keyword arguments to pass to function. Values must be constants [2021-04-06T04:11:56.618Z] (not dask.arrays) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] See Also [2021-04-06T04:11:56.618Z] -------- [2021-04-06T04:11:56.618Z] dask.array.blockwise : Generalized operation with control over block alignment. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] Examples [2021-04-06T04:11:56.618Z] -------- [2021-04-06T04:11:56.618Z] >>> import dask.array as da [2021-04-06T04:11:56.618Z] >>> x = da.arange(6, chunks=3) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> x.map_blocks(lambda x: x * 2).compute() [2021-04-06T04:11:56.618Z] array([ 0, 2, 4, 6, 8, 10]) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] The ``da.map_blocks`` function can also accept multiple arrays. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> d = da.arange(5, chunks=2) [2021-04-06T04:11:56.618Z] >>> e = da.arange(5, chunks=2) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> f = da.map_blocks(lambda a, b: a + b**2, d, e) [2021-04-06T04:11:56.618Z] >>> f.compute() [2021-04-06T04:11:56.618Z] array([ 0, 2, 6, 12, 20]) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] If the function changes shape of the blocks then you must provide chunks [2021-04-06T04:11:56.618Z] explicitly. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> y = x.map_blocks(lambda x: x[::2], chunks=((2, 2),)) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] You have a bit of freedom in specifying chunks. If all of the output chunk [2021-04-06T04:11:56.618Z] sizes are the same, you can provide just that chunk size as a single tuple. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> a = da.arange(18, chunks=(6,)) [2021-04-06T04:11:56.618Z] >>> b = a.map_blocks(lambda x: x[:3], chunks=(3,)) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] If the function changes the dimension of the blocks you must specify the [2021-04-06T04:11:56.618Z] created or destroyed dimensions. [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] >>> b = a.map_blocks(lambda x: x[None, :, None], chunks=(1, 6, 1), [2021-04-06T04:11:56.618Z] ... new_axis=[0, 2]) [2021-04-06T04:11:56.618Z] [2021-04-06T04:11:56.618Z] If ``chunks`` is specified but ``new_axis`` is not, then it is inferred to [2021-04-06T04:11:56.619Z] add the necessary number of axes on the left. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] Map_blocks aligns blocks by block positions without regard to shape. In the [2021-04-06T04:11:56.619Z] following example we have two arrays with the same number of blocks but [2021-04-06T04:11:56.619Z] with different shape and chunk sizes. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> x = da.arange(1000, chunks=(100,)) [2021-04-06T04:11:56.619Z] >>> y = da.arange(100, chunks=(10,)) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] The relevant attribute to match is numblocks. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> x.numblocks [2021-04-06T04:11:56.619Z] (10,) [2021-04-06T04:11:56.619Z] >>> y.numblocks [2021-04-06T04:11:56.619Z] (10,) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] If these match (up to broadcasting rules) then we can map arbitrary [2021-04-06T04:11:56.619Z] functions across blocks [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> def func(a, b): [2021-04-06T04:11:56.619Z] ... return np.array([a.max(), b.max()]) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> da.map_blocks(func, x, y, chunks=(2,), dtype='i8') [2021-04-06T04:11:56.619Z] dask.array [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> _.compute() [2021-04-06T04:11:56.619Z] array([ 99, 9, 199, 19, 299, 29, 399, 39, 499, 49, 599, 59, 699, [2021-04-06T04:11:56.619Z] 69, 799, 79, 899, 89, 999, 99]) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] Your block function get information about where it is in the array by [2021-04-06T04:11:56.619Z] accepting a special ``block_info`` or ``block_id`` keyword argument. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> def func(block_info=None): [2021-04-06T04:11:56.619Z] ... pass [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] This will receive the following information: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> block_info # doctest: +SKIP [2021-04-06T04:11:56.619Z] {0: {'shape': (1000,), [2021-04-06T04:11:56.619Z] 'num-chunks': (10,), [2021-04-06T04:11:56.619Z] 'chunk-location': (4,), [2021-04-06T04:11:56.619Z] 'array-location': [(400, 500)]}, [2021-04-06T04:11:56.619Z] None: {'shape': (1000,), [2021-04-06T04:11:56.619Z] 'num-chunks': (10,), [2021-04-06T04:11:56.619Z] 'chunk-location': (4,), [2021-04-06T04:11:56.619Z] 'array-location': [(400, 500)], [2021-04-06T04:11:56.619Z] 'chunk-shape': (100,), [2021-04-06T04:11:56.619Z] 'dtype': dtype('float64')}} [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] For each argument and keyword arguments that are dask arrays (the positions [2021-04-06T04:11:56.619Z] of which are the first index), you will receive the shape of the full [2021-04-06T04:11:56.619Z] array, the number of chunks of the full array in each dimension, the chunk [2021-04-06T04:11:56.619Z] location (for example the fourth chunk over in the first dimension), and [2021-04-06T04:11:56.619Z] the array location (for example the slice corresponding to ``40:50``). The [2021-04-06T04:11:56.619Z] same information is provided for the output, with the key ``None``, plus [2021-04-06T04:11:56.619Z] the shape and dtype that should be returned. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] These features can be combined to synthesize an array from scratch, for [2021-04-06T04:11:56.619Z] example: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> def func(block_info=None): [2021-04-06T04:11:56.619Z] ... loc = block_info[None]['array-location'][0] [2021-04-06T04:11:56.619Z] ... return np.arange(loc[0], loc[1]) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> da.map_blocks(func, chunks=((4, 4),), dtype=np.float_) [2021-04-06T04:11:56.619Z] dask.array [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> _.compute() [2021-04-06T04:11:56.619Z] array([0, 1, 2, 3, 4, 5, 6, 7]) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] ``block_id`` is similar to ``block_info`` but contains only the ``chunk_location``: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> def func(block_id=None): [2021-04-06T04:11:56.619Z] ... pass [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] This will receive the following information: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> block_id # doctest: +SKIP [2021-04-06T04:11:56.619Z] (4, 3) [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] You may specify the key name prefix of the resulting task in the graph with [2021-04-06T04:11:56.619Z] the optional ``token`` keyword argument. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> x.map_blocks(lambda x: x + 1, name='increment') # doctest: +SKIP [2021-04-06T04:11:56.619Z] dask.array [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] For functions that may not handle 0-d arrays, it's also possible to specify [2021-04-06T04:11:56.619Z] ``meta`` with an empty array matching the type of the expected result. In [2021-04-06T04:11:56.619Z] the example below, ``func`` will result in an ``IndexError`` when computing [2021-04-06T04:11:56.619Z] ``meta``: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> da.map_blocks(lambda x: x[2], da.random.random(5), meta=np.array(())) [2021-04-06T04:11:56.619Z] dask.array [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] Similarly, it's possible to specify a non-NumPy array to ``meta``, and provide [2021-04-06T04:11:56.619Z] a ``dtype``: [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] >>> import cupy # doctest: +SKIP [2021-04-06T04:11:56.619Z] >>> rs = da.random.RandomState(RandomState=cupy.random.RandomState) # doctest: +SKIP [2021-04-06T04:11:56.619Z] >>> dt = np.float32 [2021-04-06T04:11:56.619Z] >>> da.map_blocks(lambda x: x[2], rs.random(5, dtype=dt), meta=cupy.array((), dtype=dt)) # doctest: +SKIP [2021-04-06T04:11:56.619Z] dask.array [2021-04-06T04:11:56.619Z] """ [2021-04-06T04:11:56.619Z] if not callable(func): [2021-04-06T04:11:56.619Z] msg = ( [2021-04-06T04:11:56.619Z] "First argument must be callable function, not %s\n" [2021-04-06T04:11:56.619Z] "Usage: da.map_blocks(function, x)\n" [2021-04-06T04:11:56.619Z] " or: da.map_blocks(function, x, y, z)" [2021-04-06T04:11:56.619Z] ) [2021-04-06T04:11:56.619Z] raise TypeError(msg % type(func).__name__) [2021-04-06T04:11:56.619Z] if token: [2021-04-06T04:11:56.619Z] warnings.warn("The token= keyword to map_blocks has been moved to name=") [2021-04-06T04:11:56.619Z] name = token [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] name = "%s-%s" % (name or funcname(func), tokenize(func, *args, **kwargs)) [2021-04-06T04:11:56.619Z] new_axes = {} [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] if isinstance(drop_axis, Number): [2021-04-06T04:11:56.619Z] drop_axis = [drop_axis] [2021-04-06T04:11:56.619Z] if isinstance(new_axis, Number): [2021-04-06T04:11:56.619Z] new_axis = [new_axis] # TODO: handle new_axis [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] arrs = [a for a in args if isinstance(a, Array)] [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] argpairs = [ [2021-04-06T04:11:56.619Z] (a, tuple(range(a.ndim))[::-1]) if isinstance(a, Array) else (a, None) [2021-04-06T04:11:56.619Z] for a in args [2021-04-06T04:11:56.619Z] ] [2021-04-06T04:11:56.619Z] if arrs: [2021-04-06T04:11:56.619Z] out_ind = tuple(range(max(a.ndim for a in arrs)))[::-1] [2021-04-06T04:11:56.619Z] else: [2021-04-06T04:11:56.619Z] out_ind = () [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] original_kwargs = kwargs [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] if dtype is None and meta is None: [2021-04-06T04:11:56.619Z] try: [2021-04-06T04:11:56.619Z] meta = compute_meta(func, dtype, *args, **kwargs) [2021-04-06T04:11:56.619Z] except Exception: [2021-04-06T04:11:56.619Z] pass [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] > dtype = apply_infer_dtype(func, args, original_kwargs, "map_blocks") [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] /opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/array/core.py:691: [2021-04-06T04:11:56.619Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] func = , args = [array([[1.]])], kwargs = {} [2021-04-06T04:11:56.619Z] funcname = 'map_blocks', suggest_dtype = 'dtype', nout = None [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] def apply_infer_dtype(func, args, kwargs, funcname, suggest_dtype="dtype", nout=None): [2021-04-06T04:11:56.619Z] """ [2021-04-06T04:11:56.619Z] Tries to infer output dtype of ``func`` for a small set of input arguments. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] Parameters [2021-04-06T04:11:56.619Z] ---------- [2021-04-06T04:11:56.619Z] func: Callable [2021-04-06T04:11:56.619Z] Function for which output dtype is to be determined [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] args: List of array like [2021-04-06T04:11:56.619Z] Arguments to the function, which would usually be used. Only attributes [2021-04-06T04:11:56.619Z] ``ndim`` and ``dtype`` are used. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] kwargs: dict [2021-04-06T04:11:56.619Z] Additional ``kwargs`` to the ``func`` [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] funcname: String [2021-04-06T04:11:56.619Z] Name of calling function to improve potential error messages [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] suggest_dtype: None/False or String [2021-04-06T04:11:56.619Z] If not ``None`` adds suggestion to potential error message to specify a dtype [2021-04-06T04:11:56.619Z] via the specified kwarg. Defaults to ``'dtype'``. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] nout: None or Int [2021-04-06T04:11:56.619Z] ``None`` if function returns single output, integer if many. [2021-04-06T04:11:56.619Z] Deafults to ``None``. [2021-04-06T04:11:56.619Z] [2021-04-06T04:11:56.619Z] Returns [2021-04-06T04:11:56.619Z] ------- [2021-04-06T04:11:56.619Z] : dtype or List of dtype [2021-04-06T04:11:56.619Z] One or many dtypes (depending on ``nout``) [2021-04-06T04:11:56.619Z] """ [2021-04-06T04:11:56.619Z] args = [ [2021-04-06T04:11:56.619Z] np.ones((1,) * x.ndim, dtype=x.dtype) if isinstance(x, Array) else x [2021-04-06T04:11:56.619Z] for x in args [2021-04-06T04:11:56.619Z] ] [2021-04-06T04:11:56.619Z] try: [2021-04-06T04:11:56.619Z] with np.errstate(all="ignore"): [2021-04-06T04:11:56.619Z] o = func(*args, **kwargs) [2021-04-06T04:11:56.619Z] except Exception as e: [2021-04-06T04:11:56.619Z] exc_type, exc_value, exc_traceback = sys.exc_info() [2021-04-06T04:11:56.619Z] tb = "".join(traceback.format_tb(exc_traceback)) [2021-04-06T04:11:56.619Z] suggest = ( [2021-04-06T04:11:56.619Z] ( [2021-04-06T04:11:56.619Z] "Please specify the dtype explicitly using the " [2021-04-06T04:11:56.619Z] "`{dtype}` kwarg.\n\n".format(dtype=suggest_dtype) [2021-04-06T04:11:56.619Z] ) [2021-04-06T04:11:56.619Z] if suggest_dtype [2021-04-06T04:11:56.619Z] else "" [2021-04-06T04:11:56.619Z] ) [2021-04-06T04:11:56.619Z] msg = ( [2021-04-06T04:11:56.619Z] "`dtype` inference failed in `{0}`.\n\n" [2021-04-06T04:11:56.619Z] "{1}" [2021-04-06T04:11:56.619Z] "Original error is below:\n" [2021-04-06T04:11:56.619Z] "------------------------\n" [2021-04-06T04:11:56.619Z] "{2}\n\n" [2021-04-06T04:11:56.619Z] "Traceback:\n" [2021-04-06T04:11:56.619Z] "---------\n" [2021-04-06T04:11:56.619Z] "{3}" [2021-04-06T04:11:56.619Z] ).format(funcname, suggest, repr(e), tb) [2021-04-06T04:11:56.619Z] else: [2021-04-06T04:11:56.619Z] msg = None [2021-04-06T04:11:56.619Z] if msg is not None: [2021-04-06T04:11:56.619Z] > raise ValueError(msg) [2021-04-06T04:11:56.619Z] E ValueError: `dtype` inference failed in `map_blocks`. [2021-04-06T04:11:56.619Z] E [2021-04-06T04:11:56.619Z] E Please specify the dtype explicitly using the `dtype` kwarg. [2021-04-06T04:11:56.619Z] E [2021-04-06T04:11:56.619Z] E Original error is below: [2021-04-06T04:11:56.619Z] E ------------------------ [2021-04-06T04:11:56.619Z] E CUDARuntimeError('cudaErrorIllegalAddress: an illegal memory access was encountered') [2021-04-06T04:11:56.619Z] E [2021-04-06T04:11:56.619Z] E Traceback: [2021-04-06T04:11:56.619Z] E --------- [2021-04-06T04:11:56.619Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/array/core.py", line 391, in apply_infer_dtype [2021-04-06T04:11:56.619Z] E o = func(*args, **kwargs) [2021-04-06T04:11:56.619Z] E File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cupy/_creation/from_data.py", line 41, in array [2021-04-06T04:11:56.619Z] E return core.array(obj, dtype, copy, order, subok, ndmin) [2021-04-06T04:11:56.619Z] E File "cupy/core/core.pyx", line 2016, in cupy.core.core.array [2021-04-06T04:11:56.619Z] E File "cupy/core/core.pyx", line 2095, in cupy.core.core.array [2021-04-06T04:11:56.619Z] E File "cupy/core/core.pyx", line 2181, in cupy.core.core._send_object_to_gpu [2021-04-06T04:11:56.619Z] E File "cupy/cuda/memory.pyx", line 396, in cupy.cuda.memory.MemoryPointer.copy_from_host_async [2021-04-06T04:11:56.619Z] E File "cupy_backends/cuda/api/runtime.pyx", line 641, in cupy_backends.cuda.api.runtime.memcpyAsync [2021-04-06T04:11:56.619Z] E File "cupy_backends/cuda/api/runtime.pyx", line 247, in cupy_backends.cuda.api.runtime.check_status [2021-04-06T04:11:56.619Z] ```

kkraus14 commented 3 years ago

What version of Arrow is being used? cuDF is pinned to 1.0.1 as of now so I doubt it's something in Arrow.

In the CuPy test are you setting the CuPy allocator to RMM?

How are you creating the host memory in this case?

trivialfis commented 3 years ago

I don't think we have done anything special in this case. It's still in the test data construction phase. The data is either from sklearn dataset or constructed using da.random.

trivialfis commented 3 years ago

cuDF is pinned to 1.0.1 as of now so I doubt it's something in Arrow.

Thanks for sharing.

Looking at the rmm configuration in tests, this should be the only parameter rmm_pool_size=2GB for LocalCUDACluster.

kkraus14 commented 3 years ago

cuDF is pinned to 1.0.1 as of now so I doubt it's something in Arrow.

Thanks for sharing.

Looking at the rmm configuration in tests, this should be the only parameter rmm_pool_size=2GB for LocalCUDACluster.

Do you always import cudf? It implicitly sets the CuPy allocator under the hood.

trivialfis commented 3 years ago

Do you always import cudf?

I assume it's yes. Since it's used in pytest.mark.skipif( no_cudf() ). The no_cudf function try to import cudf.

kkraus14 commented 3 years ago

Any chance you could try not importing cudf to see if the cupy related failures go away?

trivialfis commented 3 years ago

Yup. I will try to sort it out tomorrow. Thanks for the suggestion.

trivialfis commented 3 years ago

I will need to try harder on reproducing it locally. Debugging on CI with this size of changes is not easy.

trivialfis commented 3 years ago

I think it's caused by xgboost setting the device to something else than 0 in another test.

trivialfis commented 3 years ago

I resolved the issue by ensuring xgboost doesn't change the device ordinal. But still, it might be better for cuDF to have some guards on GPU ID. Feel free to close this one if I should open a new issue on this topic. ;-) Thanks for all the replies!