Open ghost opened 2 years ago
@llvm/issue-subscribers-openmp
I also have the same assertion failure as @alban-bridonneau-arm but with a different piece of code.
Sadly, I couldn't easily create a small reproducer (note: the reproducer above does not trigger any assertion on my machine).
Here is the the piece of code that fails. I'll try to extract it from my code-base and try to make it independent from PCL / FLANN when I have more time:
pcl::KdTreeFLANN<pcl::PointXYZ>::Ptr kdTree;
/** code that initializes kdTree **/
#pragma omp parallel for schedule(dynamic, 1000) num_threads(d->threads)
for(int i = 0; i < requestedPoints.size(); ++i) {
std::vector<int> nn_indices;
std::vector<float> nn_sqr_dists;
// requestedPoint is a pcl::PointCloud<pcl::PointXYZ>& that is passed to the function this code snippet is in
kdTree->radiusSearch(requestedPoints[i], 0.2, nn_indices, nn_sqr_dists);
}
OpenMP uses the options defined by using target_link_libraries(myAppPRIVATE OpenMP::OpenMP_CXX)
in CMake
When run, I get the following assert:
Assertion failure at kmp_dispatch.cpp(1343): victim. OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1343). OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
This code fails on the following configs:
Config 1 OS: Ubuntu 21.10 compiler: clang 12.0.1-8build1 (target: x86_64-pc-linux-gnu) linker: LLD 12.0.1 CPU: AMD Ryzen 9 5900X
Config 2 OS: Ubuntu 20.04.4 LTS VM compiler: clang 12.0.1-++20211029101322+fed41342a82f-1~exp1~20211029221816.4 (target: x86_64-pc-linux-gnu) linker: LLD 12.0.1 CPU: AMD Epyc 7413 through VMWare
Config 3 OS: Arch Linux compiler: clang 12. (No longer have the codebase on that machine and am now on clang 14, so I can't get the build number of the clang versions on which I noticed the issue) linker: LDD 12 CPU: Intel i7-8700
I am seeing the same assertion (clang 14.0.6) in kmp_dispatch.cpp:1298 (https://github.com/llvm/llvm-project/blob/llvmorg-14.0.6/openmp/runtime/src/kmp_dispatch.cpp#L1298 ) as @alban-bridonneau-arm on an M1 Macbook Pro with some of our OpenMP code.
For us the assertion is triggered via the following OpenMP pragma in Alpaka https://github.com/alpaka-group/alpaka/blob/0.6.1/include/alpaka/kernel/TaskKernelCpuOmp2Blocks.hpp#L324
And I can confirm that @alban-bridonneau-arm's reproducer also fails on macOS (MacBook Pro (14-inch, 2021), Apple M1 Pro CPU) if run with at least 3 threads, e.g.
int main (int argc, char *argv[])
{
while(1)
{
#pragma omp parallel for schedule(dynamic)
for (long pidx = 0; pidx < 10; pidx++)
;
}
}
/opt/homebrew/opt/llvm/bin/clang++ -fopenmp kmp_repo.cpp
OMP_NUM_THREADS=8 ./a.out
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1298). OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Increasing the number of the threads seems to trigger the race faster. One or two threads seem to not trigger the bug.
@q-p @alban-bridonneau-arm See here for a similar problem with the M1Ultra in the Mac Studio. Note that initial symptom was partial utilization due to other services. This was followed by a crash, then the following error:
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1298).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
zsh: abort R
Reproducible code:
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 20, eta = 1, nthread = -1, nrounds = 2000000, objective = "binary:logistic")
Just a matter of time before it triggers the error.
Edit - here's the underlying code which reaches out to OpenMP for the above example:
inline int32_t OmpGetNumThreads(int32_t n_threads) {
if (n_threads <= 0) {
n_threads = std::min(omp_get_num_procs(), omp_get_max_threads());
}
n_threads = std::min(n_threads, OmpGetThreadLimit());
n_threads = std::max(n_threads, 1);
return n_threads;
}
@alban-bridonneau-arm Could you please share the compiler and OS version?
Hi, I can't find the exact commit we were using, that was a top of tree commit from the time that the bug was raised, so shortly before the LLVM14 release. LLVM itself was built with GCC 11.2.0. The OS was Ubuntu 18.04 I hope that helps, Alabn
For what it's worth (on macOS "Monterey" 12.6) on an Apple M1 Pro, and using my reproducer in https://github.com/llvm/llvm-project/issues/54422#issuecomment-1239527573 I can trigger the error using LLVM 14.0.6
> /opt/homebrew/opt/llvm@14/bin/clang++ -v
Homebrew clang version 14.0.6
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /opt/homebrew/opt/llvm@14/bin
> /opt/homebrew/opt/llvm@14/bin/clang++ -fopenmp -std=c++17 -L /opt/homebrew/opt/llvm@14/lib main.cpp
> OMP_NUM_THREADS=8 ./a.out
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1298).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
fish: Job 1, 'OMP_NUM_THREADS=8 ./a.out' terminated by signal SIGABRT (Abort)
but when using LLVM 15.0.3 the repo doesn't trigger the assert. Note using clang from 15.0.3 with libomp
from 14.0.6 also triggers the assert, so the fix is clearly from some changes in the OpenMP library.
So from my observation it seems fixed, but I haven't yet tracked down any particular change that looks like it might do that...
@q-p Unfortunately I'm still running into the assert with llvm 15.03 (though it takes longer to get there, M1 Ultra). I don't think it's entirely fixed, but it does seem to be mitigated.
Posted a patch for review that fixes the bug - https://reviews.llvm.org/D139373
I suppose https://reviews.llvm.org/D139373 fixed the issue. If not, feel free to reopen it.
Still getting this frequently with libomp 15.0.7.
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1298). OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/. zsh: abort R
@leedrake5 The patch was not back ported to LLVM 15.0.7. You could give 16 RC1 a shot.
@shiltian Would be happy to - how do I do that? I normally install as brew install libomp
Oh, then you will have to compile OpenMP. Clone the LLVM project, and check out to the release branch release/16.x
, and then configure and build the project:
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release --install-prefix=${SOMEWHERE} -B ${SOMEWHERE} -S ${path-to}/llvm-project/openmp
$ ninja install
And then set the corresponding environment variable such that the loader can find the library. You might also encounter something like unverified developer. I don't know how to deal with that, though you can allow it in system settings.
Copy, will do that and run it through the tests. I appreciate the guidance.
@shiltian I will keep an eye on it, but a machine learning algorithm successfully ran overnight on a M1 Ultra chip with 16 RC1, where it failed after an hour with 15.0.7. Much more stable build so far.
So kept an eye on it, looks like it is emerging again with 16.0.3
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1298).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Edit - this may be about how R installs xgboost more than this project, testing this out.
Fails: install.packages("xgboost", type="source")
Works:
cd ~/GitHub/xgboost
mkdir build
cd build
cmake .. CC=gcc, CXX=g++, -DR_LIB=ON
sudo make -j20
sudo make install
So kept an eye on it, looks like it is emerging again with 16.0.3
As far as I can tell, there are no changes between 16.0.2 and 16.0.3 in kmp_dispatch.cpp. Maybe you're seeing some other effect (or end up using the wrong runtime)?
I have this error, with llvm 16.0.6
I'd suggest adding an empty line in the code just before the assertion line. So that the assertion message change and people can figure out if they indeed use fresh library or inadvertently use some old version.
This issue is still here with llvm 17.0.6. The workaround however makes me confused.
This happens when nthreads is -1 or any value > 1. The lower the number of threads, the more infrequent, but still present. However, setting nthreads=1 allows models to run. That said, there is still multicore behavior - I see all threads utilized despite specifically asking for single threaded behavior.
What makes the most sense to me is that somehow nthreads=1 has become nthreads=-1, and nthreads=-1 is a multiple of max nthreads, leading to doubling (or more) the demands made in parallel. It's like the implementation has indexed to maximum cores and counts up from there. I can't think of any other explanation that would leave nthreads=1 using up all my CPU, and persistent failures with any other value. It is very possible this is an XGBoost problem, but I've seen their OpenMP code and it makes sense to me. Maybe something else is happening? I really don't know.
I have this issue on Apple M1 with Xcode 14.2 and lib openmp from llvm 17.
The reproduction steps above (https://github.com/llvm/llvm-project/issues/54422#issuecomment-1239527573)
I also noticed that using an int
variable does not reproduce the bug. But using an int64_t
does reproduce.
It looks like this issue has not been resolved. Reopen it.
Can anyone give me a small reproducer that I can try to debug it locally?
Copy of steps above:
int main (int argc, char *argv[])
{
while(1)
{
#pragma omp parallel for schedule(dynamic)
for (long pidx = 0; pidx < 10; pidx++)
;
}
}
build with openmp, and run:
OMP_NUM_THREADS=4 ./a.out
By my experience running this command twice in parallel increase the chance of producing the crash.
Replacing long
by int
"fixes" the issue on M1.
Unfortunately I tried multiple values for OMP_NUM_THREADS
(2, 4, 6, 8, 16) and ran each for 4-5 times, I never hit a crash on my M2 Ultra. Also tried 18 since M2 Ultra has 16P8E cores, so I was wondering maybe that could cause an issue but still didn't get any luck.
I am not sure this is the exact same issue, but I am seeing an assertion in kmp_dispatch.cpp (with clang 15 and clang 18, I can eventually test in-between). It occurs randomly. Here is a test program that produces the assertion under Ubuntu 24.04 LTS. There is nothing special about the program, it's just a really crude approximation of how my application uses OpenMP and standard threads.
#include <iostream>
#include <thread>
#include <vector>
#include <omp.h>
void runner(int idx)
{
int sink = 0;
int iter = 0;
while (sink != 17)
{
for (int tc = 1; tc < 2 * omp_get_max_threads(); ++tc)
{
int r = 0;
const int cs = ((tc + idx) % 3) + 2;
#pragma omp parallel num_threads(tc)
{
int tr = 0;
#pragma omp for schedule(dynamic, cs) nowait
for (int i = 0; i < 301; ++i)
{
tr += i;
}
#pragma omp critical
r += tr;
}
sink += r;
}
++iter;
std::cout << "Thread " << idx << " finished iter " << iter << " with result " << sink << std::endl;
}
}
int main()
{
std::vector<std::thread> launchers;
for (int i = 0; i < 3; ++i)
launchers.emplace_back(&runner, i);
for (auto & l : launchers)
l.join();
return 0;
}
Here are two sample outputs with clang-15:
clang++-15 -fopenmp -O2 /mnt/d/Projects/omp_repro.cpp -o a.out
time ./a.out
Sample output 1:
Thread 2 finished iter 1 with result 1399650
Thread 0 finished iter 1 with result 1399650
Thread 2 finished iter 2 with result 2799300
Thread 1 finished iter 1 with result 1399650
...
Thread 1 finished iter 27 with result 37790550
Thread 2 finished iter 33 with result 46188450
Thread 0 finished iter 30 with result 41989500
Thread 1 finished iter 28 with result 39190200
Thread 2 finished iter 34 with result 47588100
Thread 0 finished iter 31 with result 43389150
Assertion failure at kmp_dispatch.cpp(1456): vnew.p.ub * (UT)chunk <= trip.
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1456).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Aborted
real 0m39.347s
user 2m47.038s
sys 7m42.292s
Sample output 2:
Thread 2 finished iter 39 with result 54586350
Thread 1 finished iter 40 with result 55986000
Thread 2 finished iter 40 with result 55986000
Thread 0 finished iter 33 with result 46188450
Assertion failure at kmp_dispatch.cpp(1456): vnew.p.ub * (UT)chunk <= trip.
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1456).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Aborted
real 0m53.228s
user 3m44.133s
sys 10m27.280s
Here are two sample outputs with clang-18 (Ubuntu clang version 18.1.3 (1ubuntu1)) : It seems to hit the same assertion (on a different line) much faster:
clang++-18 -fopenmp -O2 /mnt/d/Projects/omp_repro.cpp -o a.out
time ./a.out
Sample run 1:
...
Thread 2 finished iter 17 with result 23794050
Thread 0 finished iter 10 with result 13996500
Thread 1 finished iter 13 with result 18195450
Thread 2 finished iter 18 with result 25193700
Thread 0 finished iter 11 with result 15396150
Thread 1 finished iter 14 with result 19595100
Thread 2 finished iter 19 with result 26593350
Assertion failure at kmp_dispatch.cpp(1617): vnew.p.ub * (UT)chunk <= trip.
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1617).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://github.com/llvm/llvm-project/issues/.
Aborted
real 0m2.093s
user 0m9.602s
sys 0m23.533s
Sample run 2:
...
Thread 0 finished iter 3 with result 4198950
Thread 0 finished iter 4 with result 5598600
Thread 1 finished iter 4 with result 5598600
Assertion failure at kmp_dispatch.cpp(1617): vnew.p.ub * (UT)chunk <= trip.
OMP: Error #13: Assertion failure at kmp_dispatch.cpp(1617).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://github.com/llvm/llvm-project/issues/.
Aborted
real 0m1.096s
user 0m4.897s
sys 0m12.546s
For comparison, on same OS / machine, using gcc 13, the code runs noticeably faster and keeps running indefinitely:
g++-13 -fopenmp -O2 /mnt/d/Projects/omp_repro.cpp -o a.out
.... keeps going forever
I tried switching the schedule to static, guided, and monotonic:dynamic in an effort to work around the assertion, but I hit it all the time (I was hoping at least one schedule doesn't use the static_stealing scheduler). I am running under WSL, but the issue I am trying to reproduce is consistently occurring on a native Ubuntu box as well.
Any suggested work-arounds would be welcome.
@StefanAtev It seems like a different issue. Could you provide the details of the processor used for the test?
@StefanAtev It seems like a different issue. Could you provide the details of the processor used for the test?
These test results are from: Processor: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz, 2611 Mhz, 8 Core(s), 16 Logical Processor(s)
It was also verified on: 13th Gen Intel(R) Core(TM) i7-13850HX, 2100 Mhz, 20 Core(s), 28 Logical Processor(s)
The same issue occurs on a server-class machine (dual Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz). Basically, we are a mixed Win/Linux Intel shop, we just switched from targeting Ubuntu 20.04 with clang 9 to targeting 24.04 with clang 18 when the issue was observed during testing.
@StefanAtev I'm not sure this is the exact same issue, but I have a patch (https://github.com/llvm/llvm-project/pull/97120) for review to fix a scheduler bug targeting hybrid systems (e.g., Raptor Lake). If possible could you please apply the patch and check if it resolves the issue.
@StefanAtev I'm not sure this is the exact same issue, but I have a patch (#97120) for review to fix a scheduler bug targeting hybrid systems (e.g., Raptor Lake). If possible could you please apply the patch and check if it resolves the issue.
I can try, it will take a while to set up to build from sources, but at a first glance, the older machine tested and the server class machine (dual Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz) don't have E cores, so I am not sure how the patch is related.
I have observed an assertion failure in openmp, while running a benchmark. I have extracted a reproducer which triggers the assert reliably.
I have compiled that code with
bin/clang -fopenmp -fopenmp-version=50 -mcpu=native kmp_assert_static_steal_reproducer.c -o kmp_assert_static_steal_reproducer
This was tested on AArch64, i don't know if it shows on other platforms. The error message is as follows:As far as I can see, this is one of several assert which have been revealed after the following patch landed: https://reviews.llvm.org/D103648 Finally, i observed while making the reproducer that the types are important to the issue. When I changed the long type to int/float, i could no longer see the issue.