Open krasznaa opened 1 year ago
It's worth adding (I only realised afterwards) that we do exactly the same test in CUDA as well.
Using the CUDA API to perform a copy from a device memory area to a managed one does not produce a runtime error in the same WSL environment. 🤔
[bash][Celeborn]:vecmem > ~/ATLAS/vecmem/build-llvm/bin/vecmem_test_cuda
[==========] Running 23 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 9 tests from cuda_containers_test
[ RUN ] cuda_containers_test.managed_memory
[ OK ] cuda_containers_test.managed_memory (658 ms)
...
[----------] 5 tests from cuda_jagged_vector_view_test
[ RUN ] cuda_jagged_vector_view_test.mutate_in_kernel
[ OK ] cuda_jagged_vector_view_test.mutate_in_kernel (2 ms)
[ RUN ] cuda_jagged_vector_view_test.set_in_kernel
[ OK ] cuda_jagged_vector_view_test.set_in_kernel (4 ms)
[ RUN ] cuda_jagged_vector_view_test.set_in_contiguous_kernel
[ OK ] cuda_jagged_vector_view_test.set_in_contiguous_kernel (6 ms)
[ RUN ] cuda_jagged_vector_view_test.filter
[ OK ] cuda_jagged_vector_view_test.filter (4 ms)
[ RUN ] cuda_jagged_vector_view_test.zero_capacity
[ OK ] cuda_jagged_vector_view_test.zero_capacity (5 ms)
[----------] 5 tests from cuda_jagged_vector_view_test (23 ms total)
...
[----------] Global test environment tear-down
[==========] 23 tests from 4 test suites ran. (1146 ms total)
[ PASSED ] 23 tests.
So there is definitely some SYCL / LLVM specificity there, it's not just that CUDA would not allow this operation. 🤔
Describe the bug
This is a super obscure error that I bumped into just now. If we can even call it an error...
One of the unit tests of our project tries to copy data between a memory area in managed/shared memory, and another one in device memory. There is a fair amount of layers between our code and the underlying SYCL code doing that, but that's what's happening here:
https://github.com/acts-project/vecmem/blob/main/tests/sycl/test_sycl_jagged_containers.sycl#L428
This code worked well on all platforms that I have tried until today. But today I tried to make it work on a pretty obscure platform. I'm using a hand-built version of the
2022-12
tag of this repository in WSL, with CUDA 11.7.1 installed in WSL as well, and the latest NVIDIA driver installed on Windows itself. In this definitely non-standard setup that test crashes with the following:There was no deep thinking behind setting up the test like this, it was just convenient for technical reasons. And as soon as I stop using shared memory there and switch to using host memory, this error disappears. But since the error only shows up on WSL, I thought it would be interesting to share this find. :wink:
To Reproduce
Is a bit difficult. 😦 I described my OS / software setup above. In that environment one can just build https://github.com/acts-project/vecmem/tree/v0.25.0 with its tests, and the error shows up. Unfortunately both setting up this build environment, and then building the project in that environment is not absolutely trivial. So I'd only produce a writeup about it on request...
Environment (please complete the following information)
Pinging @ivorobts.