[L0][UR] Multi device context buffer memory management does not work

hdelan commented 1 year ago

Using checkout clang version 17.0.0 (https://github.com/intel/llvm.git 23a6f389c1e45df077c6f15b691835b2976fda4d)

This may be a problem with the L0 adapter or with the L0 driver that I am using. Any advice would be appreciated.

#include <sycl/sycl.hpp>

constexpr size_t n = 16;

class mykernel1;
class mykernel2;

int main() {

  std::vector<int> A(n, 0);

  printf("Before: \n");
  for (auto &a : A)
    printf(" %d ", a);
  printf("\n");

  sycl::device dev1 = sycl::device::get_devices()[0];
  sycl::device dev2 = sycl::device::get_devices()[1];

  sycl::context ctx({dev1, dev2});

  sycl::queue q1{ctx, dev1};
  sycl::queue q2{ctx, dev2};

  {
    sycl::buffer BufA{A};

    q1.submit([&](sycl::handler &cgh) {
      sycl::accessor AccA{BufA, cgh};
      cgh.parallel_for<mykernel1>(
          sycl::nd_range<1>{n, n},
          [=](sycl::nd_item<1> idx) { AccA[idx.get_global_id()] += 1; });
    });
    q2.submit([&](sycl::handler &cgh) {
      sycl::accessor AccA{BufA, cgh};
      cgh.parallel_for<mykernel2>(
          sycl::nd_range<1>{n, n},
          [=](sycl::nd_item<1> idx) { AccA[idx.get_global_id()] += 100; });
    });
  }
  printf("After: \n");
  for (auto &a : A)
    printf(" %d ", a);
  printf("\n");
}

The L0 adapter should manage data migration between devices in the same context for a given buffer/image. I would expect this to print out 101 for all vals. Instead the L0 driver appears to be segfaulting.

$ sycl-ls
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a0] 1.3 [1.3.25018]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 750 [0x4c8a] 1.3 [1.3.25018]
$ clang++ -fsycl test.cpp
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./a.out 
Before: 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
Segmentation fault

Here is backtrace from gdb:


Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
0x00007fffef76a94a in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
(gdb) bt
#0  0x00007fffef76a94a in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#1  0x00007fffef73a0a8 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007fffef746280 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007ffff0a0804a in zeCommandListAppendMemoryCopy (hCommandList=0x10350d0, dstptr=0xffffd556aa500000, srcptr=0x4bb1b0, 
    size=64, hSignalEvent=0xf15a60, numWaitEvents=0, phWaitEvents=0x0)
    at /home/hughdelaney/llvm/build/_deps/level-zero-loader-src/source/lib/ze_libapi.cpp:2008
#4  0x00007ffff0b3ad32 in enqueueMemCopyHelper (CommandType=UR_COMMAND_MEM_BUFFER_WRITE, Queue=0x1016c70, 
    Dst=0xffffd556aa500000, BlockingWrite=0 '\000', Size=64, Src=0x4bb1b0, NumEventsInWaitList=0, EventWaitList=0x0, 
    OutEvent=0x101a778, PreferCopyEngine=true)
    at /home/hughdelaney/llvm/sycl/plugins/unified_runtime/ur/adapters/level_zero/memory.cpp:80
#5  0x00007ffff0b3d2b9 in urEnqueueMemBufferWrite (Queue=0x1016c70, hBuffer=0x1002870, blockingWrite=false, offset=0, 
    size=64, pSrc=0x4bb1b0, numEventsInWaitList=0, phEventWaitList=0x0, phEvent=0x101a778)
    at /home/hughdelaney/llvm/sycl/plugins/unified_runtime/ur/adapters/level_zero/memory.cpp:528
#6  0x00007ffff0b89b70 in pi2ur::piEnqueueMemBufferWrite (Queue=0x1016c70, Buffer=0x1002870, BlockingWrite=0, Offset=0, 
    Size=64, Ptr=0x4bb1b0, NumEventsInWaitList=0, EventsWaitList=0x0, OutEvent=0x101a778)
    at /home/hughdelaney/llvm/sycl/plugins/level_zero/../unified_runtime/pi2ur.hpp:3709
#7  0x00007ffff0b81bf9 in piEnqueueMemBufferWrite (Queue=0x1016c70, Buffer=0x1002870, BlockingWrite=0, Offset=0, Size=64, 
    Ptr=0x4bb1b0, NumEventsInWaitList=0, EventWaitList=0x0, Event=0x101a778)
    at /home/hughdelaney/llvm/sycl/plugins/level_zero/pi_level_zero.cpp:715
#8  0x00007ffff1636da7 in sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)81, _pi_queue*, _pi_mem*, unsigned int, unsigned long, unsigned long, char*, unsigned long, _pi_event**, _pi_event**> (this=0x4a1810)
    at /home/hughdelaney/llvm/sycl/source/detail/plugin.hpp:208
#9  0x00007ffff16343e8 in sycl::_V1::detail::plugin::call<(sycl::_V1::detail::PiApiKind)81, _pi_queue*, _pi_mem*, unsigned int, unsigned long, unsigned long, char*, unsigned long, _pi_event**, _pi_event**> (this=0x4a1810)
    at /home/hughdelaney/llvm/sycl/source/detail/plugin.hpp:225
#10 0x00007ffff162df63 in sycl::_V1::detail::copyH2D (SYCLMemObj=0x10178c0, SrcMem=0x4bb1b0 "", DimSrc=1, SrcSize=..., 
    SrcAccessRange=..., SrcOffset=..., SrcElemSize=4, DstMem=0x1002870, 
    TgtQueue=std::shared_ptr<sycl::_V1::detail::queue_impl> (use count 12, weak count 12) = {...}, DimDst=1, DstSize=..., 
    DstAccessRange=..., DstOffset=..., DstElemSize=4, DepEvents=std::vector of length 0, capacity 0, 
    OutEvent=@0x101a778: 0x1023ad0) at /home/hughdelaney/llvm/sycl/source/detail/memory_manager.cpp:514
#11 0x00007ffff162f3cd in sycl::_V1::detail::MemoryManager::copy (SYCLMemObj=0x10178c0, SrcMem=0x4bb1b0, 
    SrcQueue=std::shared_ptr<sycl::_V1::detail::queue_impl> (empty) = {...}, DimSrc=1, SrcSize=..., SrcAccessRange=..., 
    SrcOffset=..., SrcElemSize=4, DstMem=0x1002870, TgtQueue=std::shared_ptr<sycl::_V1::detail::queue_impl> (empty) = {...}, 
    DimDst=1, DstSize=..., DstAccessRange=..., DstOffset=..., DstElemSize=4, DepEvents=std::vector of length 0, capacity 0, 
    OutEvent=@0x101a778: 0x1023ad0) at /home/hughdelaney/llvm/sycl/source/detail/memory_manager.cpp:758
#12 0x00007ffff16d9d91 in sycl::_V1::detail::MemCpyCommand::enqueueImp (this=0x6450d0)
    at /home/hughdelaney/llvm/sycl/source/detail/scheduler/commands.cpp:1504
#13 0x00007ffff16d6837 in sycl::_V1::detail::Command::enqueue (this=0x6450d0, EnqueueResult=..., 
    Blocking=sycl::_V1::detail::NON_BLOCKING, ToCleanUp=std::vector of length 0, capacity 0)
    at /home/hughdelaney/llvm/sycl/source/detail/scheduler/commands.cpp:859
#14 0x00007ffff170c8ad in sycl::_V1::detail::Scheduler::GraphProcessor::enqueueCommand (Cmd=0x6450d0, GraphReadLock=..., 
    EnqueueResult=..., ToCleanUp=std::vector of length 0, capacity 0, RootCommand=0xf1ab20, 
    Blocking=sycl::_V1::detail::NON_BLOCKING) at /home/hughdelaney/llvm/sycl/source/detail/scheduler/graph_processor.cpp:154
#15 0x00007ffff170c83e in sycl::_V1::detail::Scheduler::GraphProcessor::enqueueCommand (Cmd=0xf1ab20, GraphReadLock=..., 
    EnqueueResult=..., ToCleanUp=std::vector of length 0, capacity 0, RootCommand=0xf1ab20, 
#16 0x00007ffff1702fbf in sycl::_V1::detail::Scheduler::enqueueCommandForCG (this=0x1018040, 
    NewEvent=std::shared_ptr<sycl::_V1::detail::event_impl> (use count 3, weak count 0) = {...}, 
    AuxiliaryCmds=std::vector of length 0, capacity 0, Blocking=sycl::_V1::detail::NON_BLOCKING)
    at /home/hughdelaney/llvm/sycl/source/detail/scheduler/scheduler.cpp:202
#17 0x00007ffff1702a93 in sycl::_V1::detail::Scheduler::addCG (this=0x1018040, 
    CommandGroup=std::unique_ptr<sycl::_V1::detail::CG> = {...}, 
    Queue=std::shared_ptr<sycl::_V1::detail::queue_impl> (use count 12, weak count 12) = {...}, CommandBuffer=0x0, 
    Dependencies=std::vector of length 0, capacity 0) at /home/hughdelaney/llvm/sycl/source/detail/scheduler/scheduler.cpp:148
#18 0x00007ffff174d5b2 in sycl::_V1::handler::finalize (this=0x7fffffffccc0)
    at /home/hughdelaney/llvm/sycl/source/handler.cpp:445
#19 0x00007ffff16bcef2 in sycl::_V1::detail::queue_impl::finalizeHandler<sycl::_V1::handler> (this=0x1016870, Handler=..., 
    Type=@0x7fffffffccbc: sycl::_V1::detail::CG::Kernel, EventRet=...)
    at /home/hughdelaney/llvm/sycl/source/detail/queue_impl.hpp:724
#20 0x00007ffff16bca3a in sycl::_V1::detail::queue_impl::submit_impl(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) (this=0x1016870, CGF=..., Self=std::shared_ptr<sycl::_V1::detail::queue_impl> (use count 12, weak count 12) = {...}, 
    PrimaryQueue=std::shared_ptr<sycl::_V1::detail::queue_impl> (use count 12, weak count 12) = {...}, 
    SecondaryQueue=std::shared_ptr<sycl::_V1::detail::queue_impl> (empty) = {...}, Loc=..., PostProcess=0x0)
    at /home/hughdelaney/llvm/sycl/source/detail/queue_impl.hpp:769
#21 0x00007ffff16bc7a8 in sycl::_V1::detail::queue_impl::submit(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) (this=0x1016870, CGF=..., 
    Self=std::shared_ptr<sycl::_V1::detail::queue_impl> (use count 12, weak count 12) = {...}, Loc=..., PostProcess=0x0)
    at /home/hughdelaney/llvm/sycl/source/detail/queue_impl.hpp:398
#22 0x00007ffff1785f68 in sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&) (this=0x7fffffffd2c0, CGH=..., CodeLoc=...) at /home/hughdelaney/llvm/sycl/source/queue.cpp:166
#23 0x0000000000403e24 in sycl::_V1::queue::submit<main::{lambda(sycl::_V1::handler&)#1}>(main::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffd2c0, CGF=..., CodeLoc=...)
    at llvm/build/bin/../include/sycl/queue.hpp:323
#24 0x0000000000403a87 in main () at test.cpp:27

bader commented 1 year ago

I think the discussion in https://github.com/intel/llvm/issues/4381 might be related.

hdelan commented 1 year ago

I am currently working on the multi-device context in CUDA adapter, however this bug is in the L0 adapter/L0 driver. Which is unrelated to the CUDA BE.

bader commented 1 year ago

As far I can tell, the discussion in #4381 goes beyond CUDA BE/adaptor. I didn't read carefully the whole discussion, so feel free to ignore my comment if it's not relevant.

I am currently working on the multi-device context in CUDA adapter, however this bug is in the L0 adapter/L0 driver. Which is unrelated to the CUDA BE.

Should we close https://github.com/intel/llvm/pull/6446?

hdelan commented 1 year ago

Should we close #6446?

Perhaps wait until I have a PR up ;)

As far as I can tell the discussion in #4381 is about how to get the same functionality in CUDA PI/adapter that is currently working (sometimes not working) in L0 PI/adapter.

jandres742 commented 1 year ago

The L0 adapter should manage data migration between devices in the same context for a given buffer/image

L0 doesnt migrate data between devices. If there's P2P, then peer access is used, instead of migrating the data.

To check if P2P is available, idea is to use sycl::ext::oneapi::peer_access::access_supported. We dont have that support yet in the L0 backend to return the correct value. Suggestion in the meantime would be to either user L0 interop or to use ze_peer https://github.com/oneapi-src/level-zero-tests/tree/master/perf_tests/ze_peer to confirm if such support is available in your target system.

jandres742 commented 1 year ago

correction here: From @smaslov-intel comment here:

https://github.com/intel/llvm/issues/4381#issuecomment-1130716121

we are migrating buffers, which is what the sample here is referring to. My comment above was above USM.

@smaslov-intel : is the idea that L0 backend also migrates images?

smaslov-intel commented 1 year ago

Yes, that was my assumption, see https://github.com/intel/llvm/blob/9b93d96d60ba47cb56bd835d99a0de953c63d284/sycl/plugins/unified_runtime/ur/adapters/level_zero/memory.cpp#L1614

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

intel / llvm

[L0][UR] Multi device context buffer memory management does not work #10555