Error when print values from reassigned device pointer in function

Dart120 commented 7 months ago

Describe the bug

Outside of the function the pointer is null, I pass a pointer to this pointer to the function Inside the function I allocate device memory for a struct, then change a field and then assign a pointer to the allocated struct to the pointer passed to the function. I then print the field. Once I leave the function I print the field. The first print doesn't work but the second does. I'm not sure why :(

To reproduce

Code Snippet

#include <CL/sycl.hpp>
using namespace cl::sycl;
struct MyStruct {
    int value;
};

void manipulateStruct(MyStruct** ptrToDeviceA, queue& q) {
    // Allocate new instance of MyStruct in device memory
    MyStruct* newDeviceA = malloc_device<MyStruct>(1, q);

    // Initialize the new instance with a kernel
    q.submit([&](handler& h) {
        h.single_task([=]() {
            newDeviceA->value = 100; 
        });
    }).wait();

    // If ptrToDeviceA is already pointing to a device allocation, free it
    if (*ptrToDeviceA != nullptr) {
        free(*ptrToDeviceA, q);
    }

    // Redirect ptrToDeviceA to the new device memory allocation
    *ptrToDeviceA = newDeviceA;

    size_t bufferSize = 1024;
    size_t maxStatementSize = 256;
    // This causes an error....
    q.submit([&](handler& h) {
    stream out(256, 1024, h);
        h.single_task([=]() {
          out << (*ptrToDeviceA)->value << sycl::endl;

        });
    }).wait();
}
int main() {
    queue q;

    // Pointer initially meant for host memory but is nullptr
    MyStruct* deviceA = nullptr;

    // Use manipulateStruct to allocate and initialize the struct in device memory
    manipulateStruct(&deviceA, q);
    size_t bufferSize = 1024;
    size_t maxStatementSize = 256;

    //This is completely fine!
    q.submit([&](handler& h) {
    stream out(256, 1024, h);
        h.single_task([=]() {
          out << "This many nnz: " << deviceA->value << sycl::endl;

        });
    }).wait();

    free(deviceA, q);
}

Compiled with

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda test.cpp -o test
./test

Error message

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        urEnqueueMemBufferRead
        Source Location: /home/temi/sycl_workspace/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/enqueue.cpp:1576

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)

Expected

"This many nnz: 100"

Environment

OS: Linux "pop OS" Device and Vendor: Nvidia, RTX 2080ti clang version 19.0.0git (https://github.com/intel/llvm db6a05d101b990ead474b23a9c8c8ebc6e5710c9)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 16%   43C    P8              20W / 260W |    466MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2672      G   /usr/lib/xorg/Xorg                          301MiB |
|    0   N/A  N/A      2807      G   /usr/bin/gnome-shell                         19MiB |
|    0   N/A  N/A      4083      G   firefox                                      77MiB |
|    0   N/A  N/A      8540      G   ...yOnDemand --variations-seed-version       26MiB |
|    0   N/A  N/A    688820      G   ...rker,SpareRendererForSitePerProcess       37MiB |
+---------------------------------------------------------------------------------------+

[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.3]

Platforms: 1
Platform [#1]:
    Version  : CUDA 12.3
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 7.5
        Name       : NVIDIA GeForce RTX 2080 Ti
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 12.3
        Aspects    : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthur_print: Images are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph
        info::device::sub_group_sizes: 32
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.3]
accelerator_selector()  : No device of requested type available. -1 (PI_ERRO...
cpu_selector()          : No device of requested type available. -1 (PI_ERRO...
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.3]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.3]
custom_selector(cpu)    : No device of requested type available. -1 (PI_ERRO...
custom_selector(acc)    : No device of requested type available. -1 (PI_ERRO...

Additional context

lmk if you need more info

bader commented 7 months ago

ptrToDeviceA points to the host memory, so dereferencing it on the device is invalid.

Dart120 commented 7 months ago

ptrToDeviceA points to the host memory, so dereferencing it on the device is invalid.

I thought since ptrToDeviceA is now pointing to device memory then it would be okay to dereference on the device?

because of

 // Redirect ptrToDeviceA to the new device memory allocation
    *ptrToDeviceA = newDeviceA;

bader commented 7 months ago

*ptrToDeviceA points to device memory, but not ptrToDeviceA.

Dart120 commented 7 months ago

*ptrToDeviceA points to device memory, but not ptrToDeviceA Okay so ptrToDeviceA is a host memory pointer to a device memory pointer and that's why it doesn't work? In that case, how can I do what I was trying to do in this code without the error? Thank you?

bader commented 7 months ago

I'm not sure what you are trying to do, but if you make memory allocation for deviceA is accessible from the device, the code should work.

Dart120 commented 7 months ago

So change

MyStruct* deviceA = nullptr;

to

 MyStruct* deviceA = malloc_device<MyStruct>(1, q);

I have just tried this and it didn't work

bader commented 7 months ago

Here is what I mean:

MyStruct** ptrToDeviceA = malloc_shared<MyStruct*>(1, q);
manipulateStruct(ptrToDeviceA, q);

Dart120 commented 7 months ago

Ah okay, is it not possible to just use malloc_device in this case? If not why? Sorry about all the questions, just trying to wrap my head around it!

bader commented 7 months ago

If I get it right, you want to override *ptrToDeviceA on the host. malloc_device allocations are not accessible from the host.

Please, read USM section of the spec: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_kinds_of_unified_shared_memory.

intel / llvm