Closed sklam closed 9 months ago
Dear users, i'm trying to run a script for a python3 program and i get this error 9/10. My script works, but randomly. When the code is too long, sometimes the error appears, some times it doesn't. So i have to run the script multiple times in the hope that it arrives at an end. I paste you the error that i get:
Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /Users/ci/miniconda3-arm64/conda-bld/llvmdev_1643905487494/work/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210. zsh: abort python3 test2.py
i'm running in a macbookPro with M1Pro. I have no idea how to solve this error, i don't even have idea if i can really solve it or if it depends from llvm. Do you know something about that? Thanks in advance, i hope in an answer...
Dear users, i'm trying to run a script for a python3 program and i get this error 9/10. My script works, but randomly. When the code is too long, sometimes the error appears, some times it doesn't. So i have to run the script multiple times in the hope that it arrives at an end. I paste you the error that i get:
Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /Users/ci/miniconda3-arm64/conda-bld/llvmdev_1643905487494/work/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210. zsh: abort python3 test2.py
i'm running in a macbookPro with M1Pro. I have no idea how to solve this error, i don't even have idea if i can really solve it or if it depends from llvm. Do you know something about that? Thanks in advance, i hope in an answer...
@Francyrad thank you for asking about this. You do indeed encounter the same error as reported in this issue. You may consider Numba (more precisely, LLVM) to be broken on M1. There is currently no known fix or workaround and we are not sure if this has been reported upstream to LLVM or if there is a fix in progress. IIRC @sklam also checked LLVM 14 and it appears as though this has not be fixed. My only remaining guess here would be to try to run your script in a docker container on the M1 using a linux-aarch64
docker image. Performance should not be too bad as the hardware will not be simulated in this case. Note however, that I am guessing at this and it may very well also not work.
TL:DR Running Numba on an M1 may cause the segfaults you see above and the only known workaround is to use different hardware.
@esc thank you for your answer, i hope someone will be able to fix that. Please, let me know when it will be fixed commenting this issue
Thank you again
@esc thank you for your answer, i hope someone will be able to fix that. Please, let me know when it will be fixed commenting this issue
Thank you again
Yes, we hope so too, if you subscribe to this issue, you will receive updates regarding this quest.
Hi, is there any update on this? I'm under Python3.9 and LLVM 11.1.0 and M1 mac, and am having the same issue right now when running multi-processing of a forecast model (AutoCES) under statsforecast package. I've tried to bootstrap dev versions of both numba (0.57.0.dev0+1257.gce69f3010) and llvmlite 0.40.0.dev0+70.ge6901e0) from github repos, and still failed and keep facing this issue.
It seems like the temporary fix by https://github.com/numba/numba/pull/8583 is not working for me.
I have other models tested without issues, but they're all with the numba in the backend to speed up the computing. The only difference that I can think of is this specific model using complex values rather than some real-number values.
With numba (0.46) and llvmlite(0.39), this exactly the same error is raised when running. However, with dev version of numba (0.57.0.dev0+1257.gce69f3010) and llvmlite 0.40.0.dev0+70.ge6901e0), basically the multiprocessing just stuck in the terminal without any errors raised. (But I'm pretty sure it's still the same issue)
Can anyone help here? Thanks @esc @sklam
I have still the issue. Sometimes I waste more of my time try to running my scripts instead of working
It seems like the temporary fix by #8583 is not working for me.
Can anyone help here? Thanks @esc @sklam
No, unfortunately not, there is no known workaround, it's broken in LLVM 11 and 14 (supported by next Numba/llvmlite release). I am not aware of anyone working on a fix at present, so your best bet for now will be to use non-M1/Apple silicon, i.e. change hardware. So sorry I don't have better news for you.
@sklam for reference, was this ever reported to the LLVM issue tracker and if so, can you post the issue ID please? Thank you.
Just wanted to mention that I'm having the same issue on Mac M1, llvm-openmp 16.0.2 and llvmlite 0.40.0! I run into this issue when solving systems of PDEs using py-pde. I've subscribed to this issue and fingers crossed that it will get fixed in the near-future.
@iamlll another bug that I don't is that the parallelisation with OpenMP don't work with the following chips:
M1Pro, M1Max and M1Ultra
It works just with M1
Is there some llvm where is it possible to do some report?
Just wanted to mention that I'm having the same issue on Mac M1, llvm-openmp 16.0.2 and llvmlite 0.40.0! I run into this issue when solving systems of PDEs using py-pde. I've subscribed to this issue and fingers crossed that it will get fixed in the near-future.
@iamlll The reason you are seeing this with llvmlite 0.40.0 is because it is based on LLVM 14 and that is indeed buggy.
buggy.
So how can we solve the problem With OpenMP?
This a problem of the LLVM JIT that we are using (MCJIT) and we need to migrate to OrcJIT (https://github.com/numba/llvmlite/pull/919) so we can use JitLink and hopefully that will fix it.
buggy.
So how can we solve the problem With OpenMP?
This issue is about M1 LLVM Runtimedyld Invalid page reloc value assertion error
-- you are inquiring about a different issue here. In order to keep the signal-to-noise low, please open a new issue with the OpenMP issues you are seeing, thank you!
The issue is not limited to Apple M1 or MacOS. We're seeing it on Neoverse-N1 running Ubuntu 20.04 ever since we've uprgraded to Numba 0.57. This is a server machine - and not just one. Unfortunately, we cannot downgrade Numba because we need CUDA 12.1 support.
Error message:
python: /root/miniconda3/envs/buildenv/conda-bld/llvmdev_1680642098205/work/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
System info:
uname -a: Linux <hostname_redacted> 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Mon Aug 8 18:51:21 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 20.04.4 LTS"
lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
L1d cache: 5 MiB
L1i cache: 5 MiB
L2 cache: 80 MiB
NUMA node0 CPU(s): 0-79
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
free -m
total used free shared buff/cache available
Mem: 514318 5282 73564 6 435471 504537
Swap: 2047 213 1834
I can confirm being able to reproduce a similar issue on a non-M1 AArch64 - in general we can overflow relocations - the assertion is a little different because Linux on AArch64 us using RuntimeDyldELF and not RuntimDyldMachO, but I think the principle (and the root cause) is the same, but I need to investigate further to be sure. At present I'm reproducing with DALI like:
/opt/dali/dali/test/python# DALI_EXTRA_PATH=/opt/dali_extra python -m nose2 --verbose --plugin=nose2_test_timer.plugin --with-timer --timer-color --timer-top-n 20 -A '!slow' -s operator_1 test_numba_func
test_numba_func.test_multiple_ins ... ok
test_numba_func.test_split_images_col ... ok
test_numba_func.test_numba_func:1
[(10, 10, 10)], <class 'numpy.uint8'>, <function set_all_values_to_255_batch at ... ok
test_numba_func.test_numba_func:2
[(10, 10, 10)], <class 'numpy.uint8'>, <function set_all_values_to_255_sample a ... python: /root/miniconda3/envs/buildenv/conda-bld/llvmdev_1680642098205/work/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
Aborted (core dumped)
and I need to figure out how to make a Numba-only reproducer.
I'm working on a system very similar to the one reported by @mzient in https://github.com/numba/numba/issues/8567#issuecomment-1556803212 - just some small minor OS / kernel version differences.
I couldn't trigger this issue with @sklam's script from https://github.com/numba/numba/issues/8567#issue-1432286236, even after hundreds of runs on a Linux AArch64 system. However, the following (still using DALI, but without needing a test harness) does reproduce the issue pretty reliably:
which gives this on almost every run:
$ python test_standalone.py
python: /root/miniconda3/envs/buildenv/conda-bld/llvmdev_1680642098205/work/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
Aborted (core dumped)
I've broken the Linux-specific variant of this issue into #9001 to avoid spamming everyone here as I post updates whilst I debug. Please watch / subscribe to that if you want to track as I'm working on the issue on Linux AArch64.
I'm finding this surprisingly hard to reproduce on macOS. My environment is:
@sklam Anything I might be missing here compared to the setup you used to reproduce the issue?
This issue recently presented itself to me. Any suggestions where i might dig into when trying to contribute a fix? I see we are somewhere between llvm and llvm lite?
This issue recently presented itself to me. Any suggestions where i might dig into when trying to contribute a fix? I see we are somewhere between llvm and llvm lite?
https://github.com/numba/numba/issues/9001#issuecomment-1581424023 describes the issue - the GOT is allocated more than 4GB away from a text section it refers to. If you'd like to start digging in, I'd suggest looking into the RuntimeDyld allocator to devise a strategy that ensures this can't happen. I understand JITLink has a slab allocator already which can help elide this issue, but I didn't yet get chance to look into it further,
@carstenr I had a little more thought about this recently... One of the problems that makes it hard to think about a fix is that reproducing the issue is a giant pain at present - if you're able to do anything to take the existing reproducers and simplify them at all, that would help make it easier for someone (or yourself) to understand the issue and work on a fix.
@carstenr I had a little more thought about this recently... One of the problems that makes it hard to think about a fix is that reproducing the issue is a giant pain at present - if you're able to do anything to take the existing reproducers and simplify them at all, that would help make it easier for someone (or yourself) to understand the issue and work on a fix.
I can give you a script that is able to reproduce it quite often if that can help
I can give you a script that is able to reproduce it quite often if that can help
Yes please!
I can give you a script that is able to reproduce it quite often if that can help
Yes please!
please write me in francyrad.info@gmail.com
The script and the file that you will read is quite big
Alright, that means we got two large cases then to reporoduce. We will focus on reducing is as much as possible.
Alright, that means we got two large cases then to reporoduce. We will focus on reducing is as much as possible.
Another thought I think worth sharing - it should be possible to get to a reproducer that doesn't depend on Numba at all - if it's minimised as much as possible, it would just involve calls to llvmlite. (Or even simpler than that, a small C++ source that links to LLVM only, to even take llvmlite out of the loop - but I think the "just llvmlite" case would already be a good starting point)
Might take a while to get there as our developers naturally have a strong python background. We will start with a minimal nixtla setup, which is where this popped up for us. And from there on we will work our way down.
Bump.
I am consistently seeing this on M1 Pro and M2. It's a bit involved, but it occurs with ~30% probability in my code.
Are you still looking for a reproducer @gmarkall ?
FYI by googling I noticed that when porting Julia to ARM they also hit the same bug. Look at https://github.com/JuliaLang/julia/issues/36617 and search in the page for "Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."),".
Apparently, if this can help at all, the PR that fixed the issue was https://github.com/JuliaLang/julia/pull/43664 ...
FYI by googling I noticed that when porting Julia to ARM they also hit the same bug. Look at JuliaLang/julia#36617 and search in the page for "Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."),".
Apparently, if this can help at all, the PR that fixed the issue was JuliaLang/julia#43664 ...
The problem is still present
Are you still looking for a reproducer @gmarkall ?
Luckily, and coincidentally, I was working on this today, and I now have a pretty good one, which I'm going to add to #9001 because I'm tackling the issue on Linux AArch64 at present.
In case you want to try it, it's:
from numba import njit
@njit
def f(x, y):
return x + y
i = 0
while True:
print(i)
t = tuple(range(i))
f(t, (1j,))
i += 1
executed with:
$ ulimit -s 1048576
$ python repro.py
gives:
0
1
2
3
4
5
6
7
8
9
python: /opt/conda/conda-bld/llvmdev_1684517249134/work/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:507: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
Aborted (core dumped)
It'd be interesting to know if that also triggers the error on your Mac. You might need to do something similar to my ulimit
invocation above to increase the stack limit.
I can't set the ulimit
to such large numbers on Mac. it errors with
ulimit: value exceeds hard limit
The largest ulimit I can set is ulimit -s 65520
but it is not crashing for now...
What number did it get to before you stopped it?
What number did it get to before you stopped it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
That would be great if you could give it a go!
@PhilipVinc Is it still running? :-)
@gmarkall it crashes at 1001 but I think this is due to some check in numba itself?
999
1000
1001
Traceback (most recent call last):
File "/Users/filippo.vicentini/Dropbox/Ricerca/Codes/Python/netket/repro.py", line 12, in <module>
File "/Users/filippo.vicentini/Documents/pythonenvs/netket/python-3.11.2/lib/python3.11/site-packages/numba/core/dispatcher.py", line 471, in _compile_for_args
error_rewrite(e, 'unsupported_error')
File "/Users/filippo.vicentini/Documents/pythonenvs/netket/python-3.11.2/lib/python3.11/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.UnsupportedError: Failed in nopython mode pipeline (step: ensure features that are in use are in a valid form)
Tuple 'x' length must be smaller than 1000.
Large tuples lead to the generation of a prohibitively large LLVM IR which causes excessive memory pressure and large compile times.
As an alternative, the use of a 'list' is recommended in place of a 'tuple' as lists do not suffer from this problem.
File "repro.py", line 3:
<source missing, REPL/exec in use?>
EDIT: This is with ulimit -s 65520
@PhilipVinc Thanks - indeed, that was a Numba limitation. I think in #9001 and https://github.com/gmarkall/numba-issue-9001 we're getting close to a really good reproducer now, so there's probably no need for additional testing here - thanks for everything you've looked into so far :-)
LLVM discourse discussion started to discuss a potential fix: https://discourse.llvm.org/t/llvm-rtdyld-aarch64-abi-relocation-restrictions/74616
@Francyrad @PhilipVinc @carstenr It's early work at the moment, but if you're able to build llvmlite from source with the PR https://github.com/numba/llvmlite/pull/1009, and let me know whether you still observe the issue with it (or observe any other issues) that would be good feedback - hopefully this resolves the issue, but there's a lot of testing / review to be done to have confidence in the strategy.
I have experienced this issue repeatedly over the past month, getting errors similar to the following for my ~150 line code for solving a specific PDE:
Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /Users/ci/miniconda3-arm64/conda-bld/llvmdev_1643905487494/work/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210.
@gmarkall, I'm not quite sure how to build from source but am happy to try and test it out.
@jacobjivanov Thanks for sharing this info - fortunately you don't need to build from source to test the fix now, as it's part of the llvmlite 0.42 / Numba 0.59 release candidates. You can follow the instructions here to install the Numba and llvmlite release candidates: https://numba.discourse.group/t/ann-numba-0-59-0rc1-and-llvmlite-0-42-0rc1/2329
If you try this, I'd really appreciate if you can let me know whether it appears to have solved the issue for you.
@gmarkall, I can't confirm whether it'll ever fail, but it no longer fails for the particular script that would fail roughly 50% of the time previously. Ran it ~20 times with different initial conditions.
@gmarkall Your work is greatly appreciated! Switching to the release candidate also solved the issue for one of our packages which would occasionally fail.
With llvmlite now at 0.42.0 and the new memory manager merged, can we close this?
I've not heard of any reports of this issue manifesting in llvmlite 0.42, so I think so.
I've not heard of any reports of this issue manifesting in llvmlite 0.42, so I think so.
Alright, let's put a proverbial checkmark behind this issue. We always have the option to re-open in case.
@gmarkall thank you again for the fix for this, it is much appreciated!
We are seeing a LLVM Assertion error occurring randomly in our build farm.
The error message is:
Earliest report is from gitter on July 15, 2022
The error can be triggered with the below script on bdb2384. The error usually occurs within 10 iteration.
The error occurs in both LLVM 11 and LLVM 14.
The current hypothesis is that the LLVM Runtimedyld is mishandling far jumps. To relate this to the reproducer above, the situation can be created by:
test_too_big_to_freeze
(the compilation and execution bits in the tests can be commented out and it will still trigger the error)test_fill_diagonal_basic
. The assertion error occurs here. The guess is that JITed code emitted for the stencil tests are reused here. The large allocation in between help make sure there is a gap/fragmentation in the memory space such that the fill_diagonal functions are JITed in somewhere far away.Julia devs is pointing to a broken large code model in LLVM Runtimedyld for MachO aarch64. See https://github.com/JuliaLang/julia/issues/42295#issuecomment-1008427270, https://github.com/JuliaLang/julia/pull/43664.