manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

Illegal instructions on 2.3.4 from conda-forge. #236

Closed rainwoodman closed 3 months ago

rainwoodman commented 3 years ago

General information

It is a rather old CPU with only avx. /proc/cpuinfo

Processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
stepping    : 9
microcode   : 0x21
cpu MHz     : 1252.638
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags   : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 5187.87
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Issue description

After switching to corrfunc on conda-forge, I am getting an illegal instruction error.

Expected behavior

Test shall passes.

Actual behavior

Test crashes with Illegal instruction with a stack trace in Corrfunc.

What have you tried so far?

Downgrading to corrfunc 2.3.0 on bccp channel the illegal instruction error is gone. This could be 2.3.0 to 2.3.4 update, or could be due to different ways of compilation (e.g. build detected different travis cpu types).

Minimal failing example

conda install -c bccp nbodykit
conda install -c conda-forge corrfunc
git clone https://github.com/bccp/nbodykit;
cd nbodykit
python run-tests.py  nbodykit/algorithms/pair_counters/tests/test_1d.py
manodeep commented 3 years ago

@beckermr You might have more info. Do you know how/where the compilation was done?

@rainwoodman Do you know which function produces this error? Is the error somewhere in the gridlink routines?

manodeep commented 3 years ago

This issue might be more appropriate in the feedstock repo. @beckermr What do you prefer?

rainwoodman commented 3 years ago

Yes it is in gridlink.

I was going to try to build 2.3.4 with the bccp recipe locally(which would give us a more firm signal). but got stuck at a conda-build cache error due to my weird Fedora set up at home. (https://github.com/conda/conda/issues/7227)

(bccp) [yfeng1@highland nbodykit]$ python run-tests.py --single nbodykit/algorithms/pair_counters/tests/test_1d.py 
Purging /home/yfeng1/source/nbodykit/build/testenv ...
Purging /home/yfeng1/source/nbodykit/build/test ...
Building, see build.log...
Build OK
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.6, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/yfeng1/source/nbodykit
collected 26 items                                                                                                                                                      

build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/tests/test_1d.py ........s..s.Fatal Python error: Illegal instruction

Thread 0x00007f97b0c12700 (most recent call first):
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/selectors.py", line 468 in select
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/wurlitzer.py", line 192 in forwarder
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 870 in run
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f97b1413700 (most recent call first):
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 302 in wait
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/queue.py", line 170 in get
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/wurlitzer.py", line 172 in flush_main
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 870 in run
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f97b202b700 (most recent call first):
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/extern/wurlitzer.py", line 162 in forwarder
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 870 in run
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f97c9b9e740 (most recent call first):
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/Corrfunc/theory/DD.py", line 233 in DD
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/corrfunc/base.py", line 205 in _run
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/corrfunc/base.py", line 133 in run
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/corrfunc/base.py", line 147 in __call__
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/corrfunc/theory.py", line 62 in __call__
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/simbox.py", line 223 in run
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/simbox.py", line 127 in __init__
  File "/home/yfeng1/source/nbodykit/build/testenv/lib/python3.8/site-packages/nbodykit/algorithms/pair_counters/tests/test_1d.py", line 89 in test_sim_nonperiodic_auto
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/runtests/mpi/tester.py", line 139 in wrapped
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/python.py", line 1627 in runtest
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 256 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 310 in from_call
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_and_report
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 127 in runtestprotocol
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/main.py", line 313 in _main
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/main.py", line 257 in wrap_session
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/runtests/tester.py", line 275 in _test
  File "/home/yfeng1/.conda/envs/bccp/lib/python3.8/site-packages/runtests/mpi/tester.py", line 331 in main
  File "run-tests.py", line 6 in <module>
Illegal instruction (core dumped)

And if I run this under gdb, bt gives gridlink

Thread 1 "python" received signal SIGILL, Illegal instruction.
gridlink_double (NPART=NPART@entry=395, X=X@entry=0x5555572d2320, Y=Y@entry=0x5555572d2f80, Z=Z@entry=0x5555572d3be0, WEIGHTS=WEIGHTS@entry=0x7fffffff46c0, 
    xmin=<optimized out>, xmax=<optimized out>, ymin=<optimized out>, ymax=<optimized out>, zmin=<optimized out>, zmax=<optimized out>, max_x_size=<optimized out>, 
    max_y_size=<optimized out>, max_z_size=<optimized out>, xbin_refine_factor=<optimized out>, ybin_refine_factor=<optimized out>, zbin_refine_factor=<optimized out>, 
    nlattice_x=<optimized out>, nlattice_y=<optimized out>, nlattice_z=<optimized out>, options=<optimized out>) at ../../utils/gridlink_impl_double.c:388
388 ../../utils/gridlink_impl_double.c: No such file or directory.
(gdb) bt
#0  gridlink_double (NPART=NPART@entry=395, X=X@entry=0x5555572d2320, Y=Y@entry=0x5555572d2f80, Z=Z@entry=0x5555572d3be0, WEIGHTS=WEIGHTS@entry=0x7fffffff46c0, 
    xmin=<optimized out>, xmax=<optimized out>, ymin=<optimized out>, ymax=<optimized out>, zmin=<optimized out>, zmax=<optimized out>, max_x_size=<optimized out>, 
    max_y_size=<optimized out>, max_z_size=<optimized out>, xbin_refine_factor=<optimized out>, ybin_refine_factor=<optimized out>, zbin_refine_factor=<optimized out>, 
    nlattice_x=<optimized out>, nlattice_y=<optimized out>, nlattice_z=<optimized out>, options=<optimized out>) at ../../utils/gridlink_impl_double.c:388
#1  0x00007fffdf8299fa in countpairs_double (ND1=ND1@entry=395, X1=X1@entry=0x5555572d2320, Y1=Y1@entry=0x5555572d2f80, Z1=Z1@entry=0x5555572d3be0, ND2=ND2@entry=395, 
    X2=X2@entry=0x5555572cc030, Y2=<optimized out>, Z2=<optimized out>, numthreads=<optimized out>, autocorr=<optimized out>, binfile=<optimized out>, 
    results=<optimized out>, options=<optimized out>, extra=<optimized out>) at countpairs_impl_double.c:252
#2  0x00007fffdf81e73c in countpairs (ND1=ND1@entry=395, X1=X1@entry=0x5555572d2320, Y1=Y1@entry=0x5555572d2f80, Z1=Z1@entry=0x5555572d3be0, ND2=ND2@entry=395, 
    X2=X2@entry=0x5555572cc030, Y2=0x5555572cfe00, Z2=0x5555572d0a60, numthreads=1, autocorr=0, binfile=0x7fffdf503a60 "/tmp/tmp85m2lel1", results=0x7fffffff4690, 
    options=0x7fffffff4ad0, extra=0x7fffffff46c0) at countpairs.c:67
#3  0x00007fffdf81aa3f in countpairs_countpairs (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at _countpairs.c:1393
#4  0x000055555568eeee in cfunction_call_varargs (func=0x7fffe01678b0, args=<optimized out>, kwargs=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Objects/call.c:742
#5  0x0000555555736222 in PyCFunction_Call (kwargs=0x7fffdf4f69c0, args=0x7fffe01207c0, func=0x7fffe01678b0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Objects/call.c:772
#6  do_call_core (kwdict=0x7fffdf4f69c0, callargs=0x7fffe01207c0, func=0x7fffe01678b0, tstate=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:4983
#7  _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:3559
#8  0x000055555571cb59 in PyEval_EvalFrameEx (throwflag=0, f=0x5555570cf0d0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:741
#9  _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kwnames=0x7fffe0108a58, kwargs=0x7fffdf4e8180, kwcount=14, kwstep=1, defs=0x7fffe01b3598, defcount=18, kwdefs=0x0, closure=0x0, name=0x7fffe4236330, 
    qualname=0x7fffe4236330) at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:4298
#10 0x000055555571da14 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffdf4e8180, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Objects/call.c:435
#11 0x0000555555689c29 in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=0x7fffe010fee0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Objects/call.c:199
#12 PyObject_Call (callable=0x7fffe010fee0, args=<optimized out>, kwargs=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Objects/call.c:227
#13 0x000055555573257a in do_call_core (kwdict=0x7fffe0140740, callargs=0x7fffeac2d040, func=0x7fffe010fee0, tstate=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:5010
#14 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1602094424782/work/Python/ceval.c:3559
#15 0x000055555571d637 in PyEval_EvalFrameEx (throwflag=0, f=0x5555570cbe30)
beckermr commented 3 years ago

Thanks for bumping me. We can debug here. The first thing to do is to figure out if the bug happens when the package is built outside of conda build and conda forge compilers. If so, then I doubt that anything done in the recipe is the issue.

lgarrison commented 3 years ago

I followed @rainwoodman's example and tested this on an AVX2 machine, and it doesn't seem to trigger the "Illegal Instruction" error. I do get a different error partway through the tests, but it's probably unrelated (fails on os.remove(paircount-test.json); can provide more details if we think it's related, but I probably just ran the tests wrong).

AVX2 isn't a super-helpful data point, but I'm not sure I have an AVX-only machine that I'm set up to test on!

rainwoodman commented 3 years ago

When I build 2.3.4 locally (on the machine without AVX2) with the bccp recipe (https://github.com/bccp/conda-channel-bccp/pull/40), and the test have passed.

(BTW The conda-forge recipe may be missing a gsl dependency in build -- the Makefile calls gsl-config on the build env.)

Does Corrfunc Makefile support cross compilation? If not it is there a way to disable AVX2 even if the machine compiling the code (the build env) supports it?

Does Conda-forge only target CPUs with AVX2?

manodeep commented 3 years ago

The discussion on here regarding optimised builds might be useful.

beckermr commented 3 years ago

Does Conda-forge only target CPUs with AVX2?

No. It actually disables most SIMD optimizations by default.

beckermr commented 3 years ago

The patch in this PR solves this I hope: https://github.com/conda-forge/corrfunc-feedstock/pull/20

manodeep commented 3 years ago

Hmm while that patch will solve the bug, it will remove all of the optimised kernels as well and produce quite slow code. Might be better to remove the -march=native flag specifically for compiling the gridlink* codes.

Related, is there any environment variable that can be used to detect a "conda build" kind of operation?

beckermr commented 3 years ago

Yes, this is the correct effect. conda-forge does not support high-levels SIMD optimization unless you ship a fat binary that detects the instructions at runtime.

There is an env var CONDA_BUILD which will be set to 1 IIRC.

https://docs.conda.io/projects/conda-build/en/latest/user-guide/environment-variables.html#environment-variables-set-during-the-build-process

beckermr commented 3 years ago

See this issue: https://github.com/conda-forge/corrfunc-feedstock/issues/2

manodeep commented 3 years ago

Yes, this is the correct effect. conda-forge does not support high-levels SIMD optimization unless you ship a fat binary that detects the instructions at runtime.

The compute-intensive parts of Corrfunc already have such a fat binary capability out of the box, but removing -march=native will remove all of the SIMD kernels and produce >~ 4x slower code. That's why I am suggesting compiling only the gridlink* files without the -march=native flag.

Let's try to figure this out on the other issue.

beckermr commented 3 years ago

Sure sounds great! We can redo the patching as needed to make this work.

manodeep commented 3 months ago

@rainwoodman I assume this is no longer relevant and I am closing the issue. Please feel free to reopen ...