Open phprus opened 2 years ago
It seems we have two issues:
Broken serialization of worker thread requests (more threads than system concurrency are created)
M1 Max has 10 cores (2 efficient and 8 performance cores) The test creates 10 threads.
M1 Max has 10 cores (2 efficient and 8 performance cores)
Thank you for the info, I was thinking about 8 cores, so the assumption above is not correct.
Any news?
It seems the testing approach is broken on ARM, any test using utils::SpinBarrier
might hang without real issue. We are thinking about better approach for testing.
Is there any news on this issue?
@alexey-katranov
This is not utils::SpinBarrier
issue on ARM.
I changed the test_arena_constraints test:
1) Replace utils::SpinBarrier
with C++20 std::barrier
(https://en.cppreference.com/w/cpp/thread/barrier)
2) I have added output of one character from each thread with lock.
Full code:
TEST_CASE("Test memory leaks") {
constexpr size_t num_trials = 1000;
// To reduce the test session time only one constraints object is used inside this test.
// This constraints should use all available settings to cover the most part of tbbbind functionality.
auto constraints = tbb::task_arena::constraints{}
.set_numa_id(tbb::info::numa_nodes().front())
.set_core_type(tbb::info::core_types().front())
.set_max_threads_per_core(1);
std::mutex m;
size_t current_memory_usage = 0, previous_memory_usage = 0, stability_counter = 0;
bool no_memory_leak = false;
for (size_t i = 0; i < num_trials; i++) {
{ /* All DTORs must be called before GetMemoryUsage() call*/
tbb::task_arena arena{constraints};
arena.execute([&m]{
// ---
auto max_concurrency = tbb::this_task_arena::max_concurrency();
std::cerr << std::endl << std::endl << max_concurrency << std::endl << std::endl;
std::barrier barrier(tbb::this_task_arena::max_concurrency());
// utils::SpinBarrier barrier;
// barrier.initialize(tbb::this_task_arena::max_concurrency());
// ---
tbb::parallel_for(
tbb::blocked_range<size_t>(0, tbb::this_task_arena::max_concurrency()),
[&barrier, &m](const tbb::blocked_range<size_t>&r) {
// ----
auto s = r.end()-r.begin();
(void)m;
// m.lock();
if (s != 1) {
std::cerr << "\nInvalid chunk!!!\n";
abort();
}
// std::cerr << r.begin() << "|" << std::flush;
std::cerr << r.begin() << std::endl;
// m.unlock();
barrier.arrive_and_wait();
// barrier.wait();
// ----
}
);
// ----
std::cerr << "END" << std::endl;
// ----
});
}
current_memory_usage = utils::GetMemoryUsage();
stability_counter = current_memory_usage==previous_memory_usage ? stability_counter + 1 : 0;
// If the amount of used memory has not changed during 5% of executions,
// then we can assume that the check was successful
if (stability_counter > num_trials / 20) {
no_memory_leak = true;
break;
}
previous_memory_usage = current_memory_usage;
}
REQUIRE_MESSAGE(no_memory_leak, "Seems we get memory leak here.");
}
Run test:
ctest --timeout 18 --output-on-failure -R test_arena_constraints --repeat-until-fail 200
Output 1:
Test #68: test_arena_constraints ...........***Timeout 18.04 sec
oneTBB: SPECIFICATION VERSION 1.0
oneTBB: VERSION 2021.8
oneTBB: INTERFACE VERSION 12080
oneTBB: TBB_USE_DEBUG 0
oneTBB: TBB_USE_ASSERT 0
oneTBB: ALLOCATOR scalable_malloc
oneTBB: TOOLS SUPPORT disabled
oneTBB: TBBBIND UNAVAILABLE
TBBmalloc: SPECIFICATION VERSION 1.0
TBBmalloc: VERSION 2021.8
TBBmalloc: INTERFACE VERSION 12080
TBBmalloc: TBB_USE_DEBUG 0
TBBmalloc: TBB_USE_ASSERT 0
TBBmalloc: huge pages not requested
[doctest] doctest version is "2.4.7"
[doctest] run with "--help" for options
10
0
5
1
6
7
9
8
2
3
4
END
10
0
5
7
16
8
2
9
3
0% tests passed, 1 tests failed out of 1
First trial - all threads are called (10). Second trial - the last thread is not called (only 9).
Output 2:
Test #68: test_arena_constraints ...........***Timeout 18.04 sec
oneTBB: SPECIFICATION VERSION 1.0
oneTBB: VERSION 2021.8
oneTBB: INTERFACE VERSION 12080
oneTBB: TBB_USE_DEBUG 0
oneTBB: TBB_USE_ASSERT 0
oneTBB: ALLOCATOR scalable_malloc
oneTBB: TOOLS SUPPORT disabled
oneTBB: TBBBIND UNAVAILABLE
TBBmalloc: SPECIFICATION VERSION 1.0
TBBmalloc: VERSION 2021.8
TBBmalloc: INTERFACE VERSION 12080
TBBmalloc: TBB_USE_DEBUG 0
TBBmalloc: TBB_USE_ASSERT 0
TBBmalloc: huge pages not requested
[doctest] doctest version is "2.4.7"
[doctest] run with "--help" for options
10
0
5
1
7
6
8
2
9
3
4
END
10
0
215
468
3
7
9
END
10
0
5
7
168924
3
END
10
0
5
2
3
41
6
9
8
7
END
10
0
5
7
2149
8
3
6
END
10
0
5
2
16
3
9847
END
10
0
5
738162
49
END
10
0
5
1
3
7
24
9
6
8
END
10
0
5
7
69
312
8
4
END
10
0
513
84
76
9
2
END
10
0
5
7
2
1
8
9
3
4
6
END
10
0
5
7
2
1
6
84
93
END
10
0
5
7
2
6
3
9
481
END
10
0
5
2
34
96
1
8
7
END
10
0
578
63
2
4
1
9
END
10
0
5
21
87
9
6
3
4
END
10
0
5
327
1
8
4
6
9
END
10
0
5
7
386
9
1
24
END
10
0
583492
6
1
7
END
10
0
58
62
7941
3
END
10
0
537
249
6
8
1
END
10
0
5
7
6
2
891
3
4
END
10
0
5
237
149
68
END
10
0
5
284
36
79
1
END
10
0
5
2
3184
6
79
END
10
0
5
26
978
3
41
END
10
0
5
2
16
74
8
9
3
END
10
0
5
7
2
6
1
4
3
8
9
END
10
0
5
2
1
8463
79
END
10
0
5
7134
9
8
2
6
END
10
0
5
7963
28
41
END
10
0
5
7
628
9
1
4
3
END
10
0
5
3
74289
1
6
END
10
0
2
7
1894
6
3
5
END
10
0
5
7
6
2
8
39
1
4
END
10
0
5
7
698
3
2
1
4
END
10
0
53
71
298
4
6
END
10
0
5
2
6874
3
19
END
10
0
57891
3
2
64
END
10
0
5
7
8
9
2
1
6
3
0% tests passed, 1 tests failed out of 1
Last trial - the last thread is not called (only 9 digits is printed).
cc @kboyarinov, @pavelkumbrasev
Hi @phprus, the problem that Alex did not write all details that were behind the scene. It was several month ago (so I might be mistaken in this case), there is a root cause: They way oneTBB share the tasks across threads is some sort of weak ordering. I am talking about the task spawn and signal propagation inside the internal arena. And while it is ok on arches like x86 it might lead to problems (for example hangs) on the weaker memory models (Apple M1 for example). Also it will reproduce vary rarely and only I think in a cases when we use for example barriers (It is no matter which barrier oneTBB one or standard one). And as Alex mentioned: "We are thinking about better approach for testing." we did not come to any results yet.
@pavelkumbrasev Thanks for your reply!
On ARM this error occurs very frequently. On average, for a test to fail, it needs to be run less than 50 times.
I'm think the issue #712 might have the same reasons. And this issue is reproduced on x86.
In addition, for the test_collaborative_call_once
test, I found a configuration to 100% hangs.
See my comment https://github.com/oneapi-src/oneTBB/issues/712#issuecomment-1214102772 and commit https://github.com/phprus/oneTBB/commit/eeb0154a8ca95e7ec12e5d4209225cb22195372e with new CI config. On this config (with -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON
) test test_collaborative_call_once
is 100% hangs.
This problem generally similar to reduced example:
std::atomic<int> counter{0};
tbb::task_group g;
g.run([&] { ++counter; });
while (counter == 0);
This example might hang on any system because oneTBB doesn't guaranteed parallelism and g.wait()
should be called.
And this is key part of oneTBB design ("weak semantic" on signal propagation) this part is should be critical for performance.
However, I hope we will discuss and investigate it more.
Also @isaevil could you please look at #712 and try to confirm that this is similar problem?
@pavelkumbrasev
Is the assumption that tbb::parallel_for
will be executed in tbb::this_task_arena::max_concurrency()
threads a wrong?
The problem is not in utils::SpinBarrier
, but in the fact that such a barrier cannot be written for tbb::this_task_arena::max_concurrency()
threads, because the real number of threads is not known?
the real number of threads is not known? oneTBB does not guarantee that into parallel_for will be any thread except threads that started this parallel_for. But we often use utils::SpinBarrier in our tests with assumption that threads will come into arena with almost 100%.
As you can see sometimes it is not true.
@pavelkumbrasev @isaevil
Fix for failed test 137 - test_malloc_overload_disable (Failed)
(CI commit https://github.com/phprus/oneTBB/commit/eeb0154a8ca95e7ec12e5d4209225cb22195372e):
PR #870. Review it please.
@pavelkumbrasev @phprus test_collaborative_call_once
and conformance_collaborative_call_once
also have test cases that use barrier inside TBB parallel constructions in order to check the correctness of the algorithm and to stress test it. Based on the traces @phprus gave at #712 for hanging tests, looks like this is similar problem.
@isaevil Thanks for your research! If it is a similar problem, then it is reproducible on x86 (on bare metal and in github actions).
@phprus is this issue still relevant?
@nofuturre Yes, sporadically hangs on ARM is relevant issue. And on x86_64 too (#1281)
Hi @phprus, I created a PR to collect a contribution ideas: https://github.com/oneapi-src/oneTBB/pull/1411 Do you want to put this as possible contribution perhaps as advance difficulty. I'm not sure if we find to work on this problem soon so contribution can help with it :)
By the way, if you have any other ideas you are welcomed to put them into PR.
@pavelkumbrasev
I tried researching for the cause of this problem and found a hang on x86_64 (#1281). And here my understanding is no longer enough to solve it.
Do you have any plans to fix x86_64 bug #1281?
Do you have any plans to fix x86_64 bug #1281?
This PR should fix it - https://github.com/oneapi-src/oneTBB/pull/1436
Commit: cd6a5f9f4a5bae9fc157fa03093c17b9f861c9f2
Compiler:
Debug build.
Test output:
lldb: