numba / numba

NumPy aware dynamic Python compiler using LLVM
https://numba.pydata.org/
BSD 2-Clause "Simplified" License
10k stars 1.13k forks source link

[ppc64le] RuntimeDyldELF/TOC problem (segfaults) #6606

Open ghost opened 3 years ago

ghost commented 3 years ago

(spawned from #4026)

Summary

A missing nop after a branch-instruction results in RuntimeDyldELF overwriting an instruction with the TOC restoration code. This happnes only in reloc-mode=static.

The workaround in numba was to change (for ppc) reloc_mode back to pic (https://github.com/numba/numba/commit/31ddd81d559c4aa79fb25e4c7b45b44451df95d5)

See bug-identification by obilaniu within https://github.com/numba/numba/issues/4026#issuecomment-487296000

Reproduction

# DRAFT
# on ppc64le, tested on ubuntu 16.04, Power8, Power9, qemu ppc64le

### Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh
bash Miniconda3-latest-Linux-ppc64le.sh 

### Setup Conda Environment

conda create -n ppc-segfault python=3.6 numpy=1.15 llvmlite=0.35.* scipy jinja2 cffi conda-build -y
conda activate ppc-segfault
conda install gcc_linux-ppc64le gxx_linux-ppc64le conda-build make  -y

### use your preferred location

cd ~ && mkdir numba-6606 && cd numba-6606

### get a 0.52.0 version of numba patched to use `static` for `ppc`

git clone -b 0.52.0-static https://github.com/abebeos/numba.git numba-static
cd numba-static
python setup.py build_ext --inplace

### this segfaults

./runtests.py -v numba.tests.npyufunc.test_parallel_ufunc_issues.TestParGUfuncIssues

Suggested Fix

The fix was identified during the work-in to the compiler-side of the problem. Whilst looking for similar work on the relevant (pointed out by nemanjai within https://github.com/numba/numba/issues/4026#issuecomment-657941602) PPCISelLowering.cpp / callsShareTOCBase(), the comments on the following commit indicated its relevance for the problem:

https://reviews.llvm.org/rG2812c1515627904e31605bbd4f25a887a1f8eb12

The backported (llvm-10.x) patch (incl. adopted tests):

Binaries

Patched llvm-10 / llvmlite-0.35.0 available (temporarily, for testing):

conda install llvm -c abebeos
conda install llvmlite -c abebeos

Validation

(numba repo does not automatically test on PowerPC, see https://github.com/numba/llvmlite/pull/684#issuecomment-753946728)

I've validated the fix manually on Power8 hardware, see https://github.com/numba/llvmlite/pull/684#issuecomment-755295463.

The validation should be trivial for any dev with a ready dev-system.

Validation with pre-compiled binaries (After doing the Reproduction steps above):

# Ensure you've executed the reproduction steps

### this llvm contains the fix

conda install llvmlite -c abebeos

### rebuild numba (just to be sure)

cd ~/numba-6606/numba-static
python setup.py build_ext --inplace 

### should work now
./runtests.py -v numba.tests.npyufunc.test_parallel_ufunc_issues.TestParGUfuncIssues

# full tests
./runtests.py -m16 #your machine thread count
ghost commented 3 years ago

We at IBM are concerned that there is a latent bug in RuntimeDyldELF for PPC64LE, but apparently there is no reproducer.

source: https://github.com/numba/numba/issues/4026#issuecomment-705771587

I can say with near 100% confidence that (beside this bug), no other latent bugs should be in the code (re this issue). The compiler-side code I reviewed (PPCISelLowering.cpp) is very clean, with the right amount of in-source documentation.

The work/reviews on e.g. https://reviews.llvm.org/D81126?id=274770 can be easily followed.

With a little bit effort for clarifying RuntimeDyldELF (only the code that directly affects ppc), all should be good. The clarity of RuntimeDyldELF should become similar to that of PPCISelLowering.cpp, even if the future direction will be JITLink

stuartarchibald commented 3 years ago

@abebeos Thanks for the patch(es) and for fixing this! As promised here I raised this issue at yesterday's development meeting. Outcome was that we're going to hopefully roll these fixes into the 0.54 release of Numba along with attempting an upgrade to LLVM 11.

esc commented 3 years ago

@abebeos thank you again for submitting this! We appreciate your efforts to improve Numba. Regarding the timeline: the 0.54 release cycle will most likely begin in a few weeks from now (end of Jan/ beg. of Feb). At that point I can commit to reviewing and validating your patch. I should also note, ahead of time, that we may need this patch to be ported to LLVM 11.0 since we will be attempting an upgrade from LLVM 10 to LLVM 11 as part of the 0.54 release.

esc commented 3 years ago

@abebeos thank you, I will take it into consideration.

esc commented 3 years ago

Dear @abebeos,

Judging from your response of "Ouch!" I assume that you are disappointed about the proposed timeline for validating your proposed patches? If that is the case, I am sorry that the project schedule does to align with your expectations.

Also, thank you for pre-compiling those packages but unfortunately they will be of little use for me. We will want to build and test the Numba stack (llvmdev -> llvmlite -> numba) across the supported versions of Python and Numpy on the corresponding supported architectures as part of our efforts in quality assurance.

Thank you in advance for your patience and understanding in these matters.

-esc