Multiple (~10) calls to ext_oneapi_submit_barrier within a single graph "recording session" cause significant slowdown, making each subsequent ext_oneapi_submit_barrier call take longer, up to multiple seconds:
$ clang++ -fsycl -O3 test_graph.cpp -o test_graph && ONEAPI_DEVICE_SELECTOR=level_zero:0 ./test_graph
Intel(R) Arc(TM) A770 Graphics : native
Iteration 0
ext_oneapi_submit_barrier() took 6 µs
ext_oneapi_submit_barrier({e}) took 2 µs
ext_oneapi_submit_barrier() took 3 µs
ext_oneapi_submit_barrier({e}) took 1 µs
Iteration 1
ext_oneapi_submit_barrier() took 4 µs
ext_oneapi_submit_barrier({e}) took 6 µs
ext_oneapi_submit_barrier() took 15 µs
ext_oneapi_submit_barrier({e}) took 2 µs
Iteration 2
ext_oneapi_submit_barrier() took 77 µs
ext_oneapi_submit_barrier({e}) took 2 µs
ext_oneapi_submit_barrier() took 430 µs
ext_oneapi_submit_barrier({e}) took 3 µs
Iteration 3
ext_oneapi_submit_barrier() took 2558 µs
ext_oneapi_submit_barrier({e}) took 4 µs
ext_oneapi_submit_barrier() took 15248 µs
ext_oneapi_submit_barrier({e}) took 6 µs
Iteration 4
ext_oneapi_submit_barrier() took 91701 µs
ext_oneapi_submit_barrier({e}) took 10 µs
ext_oneapi_submit_barrier() took 547211 µs
ext_oneapi_submit_barrier({e}) took 11 µs
Iteration 5
ext_oneapi_submit_barrier() took 3308109 µs
ext_oneapi_submit_barrier({e}) took 16 µs
ext_oneapi_submit_barrier() took 19922365 µs
ext_oneapi_submit_barrier({e}) took 22 µs
Done!
The (toy) code in question creates two in-order queues, and submits operations to them, interleaved with barriers:
Each iteration submits two operations (A and B) and four barriers (creating an edge from Q1 to Q2 after A; creating an edge from Q2 to Q1 after B).
Looking at the graph output from the very first iteration, it looks like multiple unnecessary edged are created from a barrier to all subsequent operations:
At iteration 4, it is a total mess (while, ideally, we should have a single chain of memcopies):
Click to show
![out](https://github.com/intel/llvm/assets/933873/9645079c-263a-4dc0-b6be-ca74a7785a3b)
To Reproduce
#include <sycl/sycl.hpp>
#include <chrono>
namespace syclex = sycl::ext::oneapi::experimental;
int main() {
for (const auto &dev : sycl::device::get_devices()) {
using graph_support = syclex::info::device::graph_support;
using gsl = syclex::graph_support_level;
const auto gs = dev.get_info<graph_support>();
std::cout << dev.get_info<sycl::info::device::name>() << " : "
<< (gs == gsl::unsupported
? "unsupported"
: (gs == gsl::emulated ? "emulated" : "native"))
<< std::endl;
if (gs != gsl::unsupported) {
sycl::context ctx{dev};
sycl::queue q1{ctx, dev, {sycl::property::queue::in_order()}};
sycl::queue q2{ctx, dev, {sycl::property::queue::in_order()}};
std::vector<sycl::queue> queuesToRecord{q1, q2};
const sycl::property_list propList{
syclex::property::graph::no_cycle_check()};
syclex::command_graph<syclex::graph_state::modifiable> graph(ctx, dev,
propList);
int *value_h = sycl::malloc_host<int>(1, ctx);
int *value_i = sycl::malloc_device<int>(1, dev, ctx);
int *value_o = sycl::malloc_device<int>(1, dev, ctx);
value_h[0] = 1;
q1.memcpy(value_i, value_h, 1 * sizeof(int)).wait_and_throw();
bool result = graph.begin_recording(queuesToRecord);
if (!result) {
std::cout << " Could not start the recording" << std::endl;
}
for (int i = 0; i < 6; i++)
{
std::cout << "Iteration " << i << std::endl;
q1.memcpy(value_o, value_i, 1 * sizeof(int));
auto t1a = std::chrono::steady_clock::now();
sycl::event e1 = q1.ext_oneapi_submit_barrier();
auto t1b = std::chrono::steady_clock::now();
q2.ext_oneapi_submit_barrier({e1});
auto t1c = std::chrono::steady_clock::now();
std::cout << "ext_oneapi_submit_barrier() took " << std::chrono::duration_cast<std::chrono::microseconds>(t1b - t1a).count() << " µs" << std::endl;
std::cout << "ext_oneapi_submit_barrier({e}) took " << std::chrono::duration_cast<std::chrono::microseconds>(t1c - t1b).count() << " µs" << std::endl;
q2.memcpy(value_i, value_o, 1 * sizeof(int));;
auto t2a = std::chrono::steady_clock::now();
sycl::event e2 = q2.ext_oneapi_submit_barrier();
auto t2b = std::chrono::steady_clock::now();
q1.ext_oneapi_submit_barrier({e2});
auto t2c = std::chrono::steady_clock::now();
std::cout << "ext_oneapi_submit_barrier() took " << std::chrono::duration_cast<std::chrono::microseconds>(t2b - t2a).count() << " µs" << std::endl;
std::cout << "ext_oneapi_submit_barrier({e}) took " << std::chrono::duration_cast<std::chrono::microseconds>(t2c - t2b).count() << " µs" << std::endl;
}
graph.end_recording();
graph.print_graph("out.dot");
auto instance = graph.finalize();
q1.ext_oneapi_graph(instance).wait_and_throw();
std::cout << " Done!" << std::endl;
}
}
}
Looking at the debugger, most of the time is spent in deep recursion in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::searchDepthFirst:
Click to show backtrace
```
(gdb) bt
#0 0x00007ffff2d7626a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#1 0x00007ffff2f2ba65 in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#2 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#3 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#4 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#5 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#6 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#7 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#8 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#9 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#10 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#11 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#12 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#13 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#14 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#15 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#16 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#17 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#18 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#19 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#20 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#21 0x00007ffff2f2b9ef in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptr, std::set, std::less >, std::allocator > >&, std::deque, std::allocator > >&, std::function&, std::deque, std::allocator > >&)>) ()
from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
--Type for more, q to quit, c to continue without paging--
#22 0x00007ffff2f2bd40 in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::searchDepthFirst(std::function&, std::deque, std::allocator > >&)>) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#23 0x00007ffff2f2c313 in sycl::_V1::ext::oneapi::experimental::detail::graph_impl::getExitNodesEvents() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#24 0x00007ffff305648a in sycl::_V1::handler::finalize() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#25 0x00007ffff307af81 in sycl::_V1::detail::queue_impl::submit_impl(std::function const&, std::shared_ptr const&, std::shared_ptr const&, std::shared_ptr const&, sycl::_V1::detail::code_location const&, std::function const*) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#26 0x00007ffff307b460 in sycl::_V1::queue::submit_impl(std::function, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#27 0x0000000000405a70 in sycl::_V1::queue::submit(sycl::_V1::queue::ext_oneapi_submit_barrier(sycl::_V1::detail::code_location const&)::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffd700, CGF=..., CodeLoc=...) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/queue.hpp:362
#28 0x0000000000404e08 in sycl::_V1::queue::ext_oneapi_submit_barrier (this=0x7fffffffd700, CodeLoc=...) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/queue.hpp:416
#29 0x00000000004041ef in main () at test_graph.cpp:56
```
Environment (please complete the following information):
OS: Ubuntu 22.04
Target device and vendor: Intel Arc A770; similar behavior with CUDA backend
Describe the bug
Multiple (~10) calls to
ext_oneapi_submit_barrier
within a single graph "recording session" cause significant slowdown, making each subsequentext_oneapi_submit_barrier
call take longer, up to multiple seconds:The (toy) code in question creates two in-order queues, and submits operations to them, interleaved with barriers:
Each iteration submits two operations (A and B) and four barriers (creating an edge from Q1 to Q2 after A; creating an edge from Q2 to Q1 after B).
Looking at the graph output from the very first iteration, it looks like multiple unnecessary edged are created from a barrier to all subsequent operations:
At iteration 4, it is a total mess (while, ideally, we should have a single chain of memcopies):
Click to show
![out](https://github.com/intel/llvm/assets/933873/9645079c-263a-4dc0-b6be-ca74a7785a3b)To Reproduce
Looking at the debugger, most of the time is spent in deep recursion in
sycl::_V1::ext::oneapi::experimental::detail::graph_impl::searchDepthFirst
:Click to show backtrace
``` (gdb) bt #0 0x00007ffff2d7626a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7 #1 0x00007ffff2f2ba65 in sycl::_V1::ext::oneapi::experimental::detail::(anonymous namespace)::visitNodeDepthFirst(std::shared_ptrEnvironment (please complete the following information):