Open fangchenli opened 3 years ago
@fangchenli, I can reproduce the exact same problem on amd64. It seems to be dependent on CXXFLAGS
-- it happens with -march=znver2
but not with whatever default gcc has.
@mgorny could you give more info about your OS and CPU? I couldn't reproduce this on Intel Mac with the -march=znver2
flag. I can only reproduce it on Raspberry Pi (Ubuntu 64bit), but I don't see this on mbp M1. So it might not be an ARM issue.
It's Gentoo Linux.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 113
Model name: AMD Ryzen 5 3600 6-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 4714,4531
CPU min MHz: 2200,0000
BogoMIPS: 7189.53
Virtualization: AMD-V
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 3 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-11
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx m
mxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqd
q monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_
legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_
llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cq
m rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flus
hbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
I'm sorry, I wasn't correct. -march=
is insufficient to reproduce it. I'll have the right flag in a few minutes.
Scratch that. I didn't notice the file's in C++ and I didn't update CXXFLAGS. Apparently you need to have -march=znver2
and -O2
or -Os
in CXXFLAGS. Did you include both on your amd64 system?
In any case, I've been able to pin it down to -mfma
, i.e. can reproduce it with -march=x86-64 -O2 -mfma
.
My results:
FAIL -march=znver2 -O2 -pipe
FAIL -march=znver2 -O2
PASS -march=znver2 -O1
PASS -march=znver2 -Og
# -O0 fails to build with irrelevant linking error
FAIL -march=znver2 -Os
FAIL -march=znver1 -O2
PASS -march=btver2 -O2
# -march=bdver* SIGILLs on me
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -mmwaitx -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mrdseed -msha -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mavx2 -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mfsgsbase -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
FAIL -march=btver2 -O2 -mbmi -mbmi2 -mfma -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
PASS -march=btver2 -O2 -mbmi -mbmi2 -mfsgsbase -mavx2 -mrdseed -mmwaitx -msha -mclzero -mmovbe -mclflushopt -mpopcnt
PASS -march=znver2 -O2 -mno-fma
FAIL -march=x86-64 -O2 -mfma
Not sure how much help is that but I was able to make pandas/tests/window/test_rolling.py::test_rolling_var_numerical_issues[var-1-values0]
by adding __attribute__((target("no-fma")))
to the following three Cython-generated functions:
__pyx_f_6pandas_5_libs_6window_12aggregations_remove_var
__pyx_pw_6pandas_5_libs_6window_12aggregations_5roll_var
__pyx_pf_6pandas_5_libs_6window_12aggregations_4roll_var
So I guess the remove_var
and roll_var
functions are affected.
@mgorny Thank you so much for this input.
@pandas-dev/pandas-core Any thought on this?
Does it always fail? Wondering if these expect a particular alignment.
Yes, it fails reliably. I don't think it's alignment-related, as it produces the wrong result rather than crashing. I'm wondering if it could be a GCC optimization bug.
It definitely sounds like a compiler bug. It is unfortunately really hard to do things like add pragmas or ifdefs in Cython from what I can tell. It might be possible to tell compiler to disable optimizations in these functions using C99 Pragma_ but these are heavy handed. Probably simpler to just pass specific compiler flags for this file on ARM so that it is possible less aggressively optimized (e.g. ,-O1).
It definitely sounds like a compiler bug. It is unfortunately really hard to do things like add pragmas or ifdefs in Cython from what I can tell. It might be possible to tell compiler to disable optimizations in these functions using C99 Pragma_ but these are heavy handed. Probably easiest to just pass specific compiler flags for this file on ARM so that it is possible less aggressively optimized (e.g. ,O1).
Anyone want to build with clang with the same flags rather than gcc to see if it reproduces?
Will do in a minute. I can also try gcc-9, in case version matters (I'm running gcc 10.2.0).
gcc-9.3.0 suffers the same problem, clang++ too. However, note that on clang I had to pass -march=znver2
, -march=x86-64 -mfma
did not reproduce it.
Note that I'm only testing on AMD64. I don't know if AArch64 implements FMA equivalent (as part of NEON, maybe?). But it smells a bit suspicious that two compilers would have the same bug on two different architectures.
Heh, I've just looked at bug #37051 and now I feel stupid. It's not a bug but actually a 'bugfix'. FWICS the test is supposed to test for artifacts due to precision loss, so it obviously fails when the calculations are done with a better precision which is probably what's happening here.
Independently of compiler flags for pandas, I can reproduce it with:
diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index efacfad..184e666 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -2,7 +2,7 @@
import cython
-from libc.math cimport round
+from libc.math cimport round, fma
from libcpp.deque cimport deque
import numpy as np
@@ -333,7 +333,8 @@ cdef inline void remove_var(float64_t val, float64_t *nobs, float64_t *mean_x,
compensation[0] = t + mean_x[0] - y
delta = t
mean_x[0] = mean_x[0] - delta / nobs[0]
- ssqdm_x[0] = ssqdm_x[0] - (val - prev_mean) * (val - mean_x[0])
+ # ssqdm_x[0] = ssqdm_x[0] - (val - prev_mean) * (val - mean_x[0])
+ ssqdm_x[0] = fma(prev_mean - val, val - mean_x[0], ssqdm_x[0])
else:
mean_x[0] = 0
ssqdm_x[0] = 0
i.e. this is clearly due to FMA application here, and the compiler is applying it correctly.
So I guess it's better to use FMA explicitly in the code rather than relying on optimization.
This test was added in #37055. @phofl
We also have seen issues with this test on Linux ppc64le
. Would it make sense to update the title here to reflect both architectures have issues?
xref: https://github.com/conda-forge/pandas-feedstock/issues/149 xref: https://github.com/pandas-dev/pandas/pull/50349
FWIW just reproduced this on macOS M1. So think it may be a general ARM issue.
Tests failed on an M2 MacBook Air on latest commit to main: 5cf5c73
Model Name: MacBook Air Model Identifier: Mac14,2 Chip: Apple M2 Total Number of Cores: 8 (4 performance and 4 efficiency) Memory: 16 GB System Firmware Version: 8422.121.1 OS Loader Version: 8422.121.1 System Version: macOS 13.4 (22F66) Kernel Version: Darwin 22.5.0 Secure Virtual Memory: Enabled System Integrity Protection: Enabled
The test has the follower mark:
@pytest.mark.xfail(
(is_platform_arm() and not is_platform_mac()) or is_platform_power(),
reason="GH 38921",
)
@pytest.mark.parametrize(
("func", "third_value", "values"),
[
("var", 1, [5e33, 0, 0.5, 0.5, 2, 0]),
("std", 1, [7.071068e16, 0, 0.7071068, 0.7071068, 1.414214, 0]),
("var", 2, [5e33, 0.5, 0, 0.5, 2, 0]),
("std", 2, [7.071068e16, 0.7071068, 0, 0.7071068, 1.414214, 0]),
],
)
def test_rolling_var_numerical_issues(func, third_value, values):
The not is_platform_mac()
was added in #41982. If this error can be reproduced on another M2, perhaps this should be marked xfail
for Macs as well?
I get everything green while running pytest pandas/tests/window/test_rolling.py -n 7
on a M1 Macbook Air.
Thank you for checking this on your system as well. I just ran the same command, pytest pandas/tests/window/test_rolling.py -n 7
(and with -n auto
as well), and got the same error. I'm not exactly sure what these tests are for... do you think you could help me understand what the issue is, and if it might be on my end?
test_rolling_var_numerical_issues[var-1-values0]
Other relevant info that I could find:
It's a bit difficult to say, from your pytest logs it's not obvious that it has anything to do with the mac chips.
You could try if you have some package that needs updating, or you need to recompile the cython modules. Sorry for not being very helpful.
No worries! Thank you for taking a look at it anyway.
I rm -r
'ed the build
directory and ran python setup.py build_ext -j 4
and python -m pip install -e . --no-build-isolation --no-use-pep517
to build and install pandas according to the contributor guide. The --no-use-pep517
have me the following error:
ERROR: Disabling PEP 517 processing is invalid: project specifies a build backend of mesonpy in pyproject.toml
I ran it without the --no-use-pep517
option. I'm not sure if this has anything to do with the failing tests. I am now on 2.1.0.dev0+958.g5780d4f8f1
and I am still failing all 4 tests for test_rolling_var_numerical_issues
. Any guidance would be appreciated.
This is related to Mac chips (I am seeing the same failures). I agree that we should skip those tests for Mac (I didn't care enough in the past to put up a pr myself)
The test itself is correct, the Mac compilers remove one optimisation step when compiling which introduces numerical instabilities
Is this really a bug or just bad input data? Isn't the input data going beyond any reasonable precision limits for floats?
The idea is that the previous values don't impact windows that come afterwards and don't have these values anymore. So yes this makes sense
I get everything green while running
pytest pandas/tests/window/test_rolling.py -n 7
on a M1 Macbook Air.
I got this test fail on M1 Macbook Pro.
I do actually get the bugs now.
I think the difference is that in my last comment I built using python setup.py build_ext --inplace -j4
while now I've updated the build system to use meson and build using python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true
.
Can you double check this? I got them with setuptools as well
It wasn't just the build method, but also the environment. I changed the environment when switching to meson. This is the old environment below.
So currently we have:
I'd like to try the setuptools build on M2 to see what happens there. @topper-123, how did you reset your environment and build folder for that? I don't want to miss a step...
how did you reset your environment and build folder for that?
The old environment still exists, it's just that I couldn't get it to build with meson, so I created a completely new one. In the new environment I get the failures both with meson and (after doing git checkout 43f1bc8fb6d7d
) with python setup.py
. In the old environment I didn't get the failures after building with python setup.py
and I'm not able to build/compile pandas with meson, so not able to run the tests with a meson build.
If you're interested I could provide some logs from the old environment from the meson build failure. I created the new environment because I gave up trying to figure out why it was failing, so I don't think I can explain what was happening, but you're welcome to take a look.
If meson and setuptools yield different results then they must not be compiling the same way. Can you check the flags sent to gcc in both methods and see what may be different? Also are there any build warnings?
I am not proficient in gcc or gcc flags, or what to look for in the gcc logs, sorry. If you can give some hints what to look for I could give a shot, but not sure I will make sense of it.
No problem. I'm pretty sure that this test hits the aggregations.pyx file in pandas/_libs/window/aggregations.pyx . So you are looking for a gcc command that generates aggregations.o from the Cython-generated aggregations.cpp file
setuptools prints the commands it uses for compilation directly to stdout. So when you do the traditional setup.py build_ext look for something like this in your terminal:
/home/willayd/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare
-DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/willayd/mambaforge/envs/pandas-dev/include -fPIC -O2
-isystem /home/willayd/mambaforge/envs/pandas-dev/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC
-fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe
-isystem /home/willayd/mambaforge/envs/pandas-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2
-O2 -isystem /home/willayd/mambaforge/envs/pandas-dev/include -fPIC -DNPY_NO_DEPRECATED_API=0
-Ipandas/_libs/include -I/home/willayd/mambaforge/envs/pandas-dev/lib/python3.10/site-packages/numpy/core/include
-I/home/willayd/mambaforge/envs/pandas-dev/include/python3.10 -c pandas/_libs/window/aggregations.cpp
-o build/temp.linux-x86_64-cpython-310/pandas/_libs/window/aggregations.o
In the above command you can see I am using the x86_64-conda-linux-gnu-cc compiler and passing it compilation options like -O2
, -Wall
, -fPIC
, fwrapv
, etc... If you cared to see what these represented the gcc documentation is pretty good (and I think also describes clang behavior most of the time):
https://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html
As far as meson is concerned, it doesn't print out the gcc command by default, but you can specify the compile-args as verbose with --config-settings=compile-args="--verbose"
@lithomas1 might even know a better way
As a complete example, you can try python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true --config-settings=compile-args="--verbose"
and again check stdout for the compilation command issued for aggregations.pyx. Here is what I get locally for comparison:
/home/willayd/mambaforge/envs/pandas-dev/bin/x86_64-conda-linux-gnu-c++ -shared -Wl,--allow-shlib-undefined
-Wl,-rpath,/home/willayd/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/willayd/mambaforge/envs/pandas-dev/lib
-L/home/willayd/mambaforge/envs/pandas-dev/lib -Wl,--allow-shlib-undefined
-Wl,-rpath,/home/willayd/mambaforge/envs/pandas-dev/lib -Wl,-rpath-link,/home/willayd/mambaforge/envs/pandas-dev/lib
-L/home/willayd/mambaforge/envs/pandas-dev/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,
-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined
-Wl,-rpath,/home/willayd/mambaforge/envs/pandas-dev/lib
-Wl,-rpath-link,/home/willayd/mambaforge/envs/pandas-dev/lib
-L/home/willayd/mambaforge/envs/pandas-dev/lib -march=nocona -mtune=haswell -ftree-vectorize
-fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe
-isystem /home/willayd/mambaforge/envs/pandas-dev/include -DNDEBUG
-D_FORTIFY_SOURCE=2 -O2
-isystem /home/willayd/mambaforge/envs/pandas-dev/include build/temp.linux-x86_64-cpython-310/pandas/_libs/window/aggregations.o
-o build/lib.linux-x86_64-cpython-310/pandas/_libs/window/aggregations.cpython-310-x86_64-linux-gnu.so
Arguments that follow the pattern -Wl,<something>
are arguments to the linker; I would be very surprised if those make a difference; ignoring those you can see above that the arguments -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe
are shared across setuptools / meson on my computer, so I wouldn't expect any differences. Guessing this may not hold true on your platform if the two compilation methods yield different results
Re: build warnings, definitely keep an out eye for them. If you are compiling serially then the warning should appear directly below the compilation command being used, but in parallel compilation you may need to look further down the output.
When in doubt, paste the entire output of the compilations and we can look at them here
Actually an even easier way to see what meson did is to look at the compile_commands.json
file that it places in your build folder. That is persisted on disk and has the added benefit or removing linker arguments from the output, so its easier to read than what gets written to stdout
I have this failing for months, this is not solely related to meson.
The test
pandas/tests/window/test_rolling.py::test_rolling_var_numerical_issues
has failed on arm64 build.