intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 735 forks source link

[Graph] Exponential slowdown with multiple barriers #11915

Closed al42and closed 10 months ago

al42and commented 11 months ago

Describe the bug

Multiple (~10) calls to ext_oneapi_submit_barrier within a single graph "recording session" cause significant slowdown, making each subsequent ext_oneapi_submit_barrier call take longer, up to multiple seconds:

$ clang++ -fsycl -O3 test_graph.cpp -o test_graph && ONEAPI_DEVICE_SELECTOR=level_zero:0 ./test_graph 
Intel(R) Arc(TM) A770 Graphics : native
Iteration 0
ext_oneapi_submit_barrier() took 6 µs
ext_oneapi_submit_barrier({e}) took 2 µs
ext_oneapi_submit_barrier() took 3 µs
ext_oneapi_submit_barrier({e}) took 1 µs
Iteration 1
ext_oneapi_submit_barrier() took 4 µs
ext_oneapi_submit_barrier({e}) took 6 µs
ext_oneapi_submit_barrier() took 15 µs
ext_oneapi_submit_barrier({e}) took 2 µs
Iteration 2
ext_oneapi_submit_barrier() took 77 µs
ext_oneapi_submit_barrier({e}) took 2 µs
ext_oneapi_submit_barrier() took 430 µs
ext_oneapi_submit_barrier({e}) took 3 µs
Iteration 3
ext_oneapi_submit_barrier() took 2558 µs
ext_oneapi_submit_barrier({e}) took 4 µs
ext_oneapi_submit_barrier() took 15248 µs
ext_oneapi_submit_barrier({e}) took 6 µs
Iteration 4
ext_oneapi_submit_barrier() took 91701 µs
ext_oneapi_submit_barrier({e}) took 10 µs
ext_oneapi_submit_barrier() took 547211 µs
ext_oneapi_submit_barrier({e}) took 11 µs
Iteration 5
ext_oneapi_submit_barrier() took 3308109 µs
ext_oneapi_submit_barrier({e}) took 16 µs
ext_oneapi_submit_barrier() took 19922365 µs
ext_oneapi_submit_barrier({e}) took 22 µs
   Done!

The (toy) code in question creates two in-order queues, and submits operations to them, interleaved with barriers:

Queue 1: --A-\-----------A----\-------------A--->
              \      /         \         /
Queue 2: --------B--/--------------B----/------->

Each iteration submits two operations (A and B) and four barriers (creating an edge from Q1 to Q2 after A; creating an edge from Q2 to Q1 after B).

Looking at the graph output from the very first iteration, it looks like multiple unnecessary edged are created from a barrier to all subsequent operations:

out

At iteration 4, it is a total mess (while, ideally, we should have a single chain of memcopies):

Click to show ![out](https://github.com/intel/llvm/assets/933873/9645079c-263a-4dc0-b6be-ca74a7785a3b)

To Reproduce

#include <sycl/sycl.hpp>
#include <chrono>

namespace syclex = sycl::ext::oneapi::experimental;

int main() {
  for (const auto &dev : sycl::device::get_devices()) {
    using graph_support = syclex::info::device::graph_support;
    using gsl = syclex::graph_support_level;
    const auto gs = dev.get_info<graph_support>();
    std::cout << dev.get_info<sycl::info::device::name>() << " : "
              << (gs == gsl::unsupported
                      ? "unsupported"
                      : (gs == gsl::emulated ? "emulated" : "native"))
              << std::endl;
    if (gs != gsl::unsupported) {
      sycl::context ctx{dev};
      sycl::queue q1{ctx, dev, {sycl::property::queue::in_order()}};
      sycl::queue q2{ctx, dev, {sycl::property::queue::in_order()}};
      std::vector<sycl::queue> queuesToRecord{q1, q2};

      const sycl::property_list propList{
          syclex::property::graph::no_cycle_check()};
      syclex::command_graph<syclex::graph_state::modifiable> graph(ctx, dev,
                                                                   propList);

      int *value_h = sycl::malloc_host<int>(1, ctx);
      int *value_i = sycl::malloc_device<int>(1, dev, ctx);
      int *value_o = sycl::malloc_device<int>(1, dev, ctx);

      value_h[0] = 1;

      q1.memcpy(value_i, value_h, 1 * sizeof(int)).wait_and_throw();

      bool result = graph.begin_recording(queuesToRecord);
      if (!result) {
        std::cout << "  Could not start the recording" << std::endl;
      }

      for (int i = 0; i < 6; i++)
      {
          std::cout << "Iteration " << i << std::endl;
          q1.memcpy(value_o, value_i, 1 * sizeof(int));

          auto t1a = std::chrono::steady_clock::now();
          sycl::event e1 = q1.ext_oneapi_submit_barrier(); 
          auto t1b = std::chrono::steady_clock::now();
          q2.ext_oneapi_submit_barrier({e1});
          auto t1c = std::chrono::steady_clock::now();
          std::cout << "ext_oneapi_submit_barrier() took " << std::chrono::duration_cast<std::chrono::microseconds>(t1b - t1a).count() << " µs" << std::endl;
          std::cout << "ext_oneapi_submit_barrier({e}) took " << std::chrono::duration_cast<std::chrono::microseconds>(t1c - t1b).count() << " µs" << std::endl;

          q2.memcpy(value_i, value_o, 1 * sizeof(int));;

          auto t2a = std::chrono::steady_clock::now();
          sycl::event e2 = q2.ext_oneapi_submit_barrier(); 
          auto t2b = std::chrono::steady_clock::now();
          q1.ext_oneapi_submit_barrier({e2});
          auto t2c = std::chrono::steady_clock::now();
          std::cout << "ext_oneapi_submit_barrier() took " << std::chrono::duration_cast<std::chrono::microseconds>(t2b - t2a).count() << " µs" << std::endl;
          std::cout << "ext_oneapi_submit_barrier({e}) took " << std::chrono::duration_cast<std::chrono::microseconds>(t2c - t2b).count() << " µs" << std::endl;
      }

      graph.end_recording();
      graph.print_graph("out.dot");
      auto instance = graph.finalize();

      q1.ext_oneapi_graph(instance).wait_and_throw();
      std::cout << "   Done!" << std::endl;
    }
  }
}

Looking at the debugger, most of the time is spent in deep recursion in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::searchDepthFirst:

Click to show backtrace ``` (gdb) bt #0 0x00007ffff2d7626a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #1 0x00007ffff2f2ba65 in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #2 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #3 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #4 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #5 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #6 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #7 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #8 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #9 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #10 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #11 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #12 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #13 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #14 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #15 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #16 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #17 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #18 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #19 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #20 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #21 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 --Type for more, q to quit, c to continue without paging-- #22 0x00007ffff2f2bd40 in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::searchDepthFirst(std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #23 0x00007ffff2f2c313 in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::getExitNodesEvents() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #24 0x00007ffff305648a in sycl::_V1::handler::finalize() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #25 0x00007ffff307af81 in sycl::_V1::detail::queue_impl::submit_impl(std::function const&, std::shared_ptr const&, std::shared_ptr const&, std::shared_ptr const&, sycl::_V1::detail::code_location const&, std::function const*) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #26 0x00007ffff307b460 in sycl::_V1::queue::submit_impl(std::function, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #27 0x0000000000405a70 in sycl::_V1::queue::submit(sycl::_V1::queue::ext_oneapi_submit_barrier(sycl::_V1::detail::code_location const&)::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffd700, CGF=..., CodeLoc=...) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/queue.hpp:362 #28 0x0000000000404e08 in sycl::_V1::queue::ext_oneapi_submit_barrier (this=0x7fffffffd700, CodeLoc=...) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/queue.hpp:416 #29 0x00000000004041ef in main () at test_graph.cpp:56 ```

Environment (please complete the following information):

al42and commented 10 months ago

Fixed by https://github.com/intel/llvm/pull/11933.