pyspline doesn't work with ifx

A-CGray commented 5 months ago

Description

pyspline tests are failing on the latest intel docker images that use the ifx compiler, see test outputs here

Copied here for archiving

pyspline
------------------------------------------------------------------------------------------------------------------------
/home/***/repos/pyspline/tests/reg_tests/test_curves.py:Test.test  ... FAIL (00:00:0.05, 177 MB)
Traceback (most recent call last):
  File "/home/***/repos/pyspline/tests/reg_tests/test_curves.py", line 99, in test
    self.regression_test(handler)
  File "/home/***/repos/pyspline/tests/reg_tests/test_curves.py", line 156, in regression_test
    run_project_test(curve, handler, test_name)
  File "/home/***/repos/pyspline/tests/reg_tests/test_curves.py", line 77, in run_project_test
    handler.root_add_val("{} projection test for point {} solution".format(test_name, i), s[i], tol=1e-9)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 182, in root_add_val
    self._add_values(name, values, **kwargs)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 321, in _add_values
    self.assert_allclose(values, db[name], name, rtol, atol, full_name)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 270, in assert_allclose
    np.testing.assert_allclose(actual, reference, rtol=rtol, atol=atol, err_msg=msg)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-09, atol=1e-09
Failed value for: LMS test k=4 projection test for point 0 solution
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.77068815
Max relative difference: 1.
 x: array(0.)
 y: array(0.770688)

/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/coverage/inorout.py:519: CoverageWarning: Module pyspline was previously imported, but not measured (module-not-measured)
  self.warn(msg, slug="module-not-measured")
/home/***/repos/pyspline/tests/reg_tests/test_surfaces.py:Test.test  ... FAIL (00:00:0.26, 181 MB)
Traceback (most recent call last):
  File "/home/***/repos/pyspline/tests/reg_tests/test_surfaces.py", line 165, in test
    self.regression_test(handler)
  File "/home/***/repos/pyspline/tests/reg_tests/test_surfaces.py", line 185, in regression_test
    run_project_test(surface, handler, test_name)
  File "/home/***/repos/pyspline/tests/reg_tests/test_surfaces.py", line 95, in run_project_test
    handler.root_add_val("{} point {} projection u".format(test_name, pt), u, tol=eps)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 182, in root_add_val
    self._add_values(name, values, **kwargs)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 321, in _add_values
    self.assert_allclose(values, db[name], name, rtol, atol, full_name)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 270, in assert_allclose
    np.testing.assert_allclose(actual, reference, rtol=rtol, atol=atol, err_msg=msg)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-08, atol=1e-08
Failed value for: surface with ku=2, kv=4, nCtlu=5, nCtlv=5 point [0, 0, 0] projection u
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 3.71377595e-06
Max relative difference: 0.02136224
 x: array(0.00017)
 y: array(0.000174)

/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/coverage/inorout.py:519: CoverageWarning: Module pyspline was previously imported, but not measured (module-not-measured)
  self.warn(msg, slug="module-not-measured")
/home/***/repos/pyspline/tests/reg_tests/test_volumes.py:Test.test  ... FAIL (00:00:0.50, 179 MB)
Traceback (most recent call last):
  File "/home/***/repos/pyspline/tests/reg_tests/test_volumes.py", line 66, in test
    self.regression_test(handler)
  File "/home/***/repos/pyspline/tests/reg_tests/test_volumes.py", line 2220, in regression_test
    run_project_test(volume, handler, test_name, pts=test_pts_outside, volBounds=volBounds)
  File "/home/***/repos/pyspline/tests/reg_tests/test_volumes.py", line 41, in run_project_test
    handler.root_add_val("{} project point u for pt={}".format(test_name, pt), u, tol=eps)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 182, in root_add_val
    self._add_values(name, values, **kwargs)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 321, in _add_values
    self.assert_allclose(values, db[name], name, rtol, atol, full_name)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/baseclasses/testing/pyRegTest.py", line 270, in assert_allclose
    np.testing.assert_allclose(actual, reference, rtol=rtol, atol=atol, err_msg=msg)
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 718, in assert_array_compare
    flagged = func_assert_same_pos(x, y, func=isnan, hasval='nan')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 688, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-08, atol=1e-08
Failed value for: volume bounds with ku=2, kv=3, kw=4 project point u for pt=[ 0.16  -0.013  0.39 ]
x and y nan location mismatch:
 x: array(nan)
 y: array(0.5)

/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/coverage/inorout.py:519: CoverageWarning: Module pyspline was previously imported, but not measured (module-not-measured)
  self.warn(msg, slug="module-not-measured")

The following tests failed:
test_curves.py:Test.test
test_surfaces.py:Test.test
test_volumes.py:Test.test

Passed:  0
Failed:  3
Skipped: 0

Ran 3 tests using 8 processes
Wall clock time:   00:00:2.60

Name                                                                                         Stmts   Miss  Cover
----------------------------------------------------------------------------------------------------------------
/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyspline/pySurface.py     585    247    58%
/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyspline/pyCurve.py       401    [219](https://github.com/mdolab/docker/actions/runs/9069190501/job/25147882941#step:13:220)    45%
/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyspline/pyVolume.py      486    308    37%
/home/***/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyspline/utils.py         153     99    35%
----------------------------------------------------------------------------------------------------------------

Steps to reproduce issue

…
…
…

Current behavior

…

Expected behavior

…

Code versions

Operating System:
Python:
OpenMPI:
CGNS:
PETSc:
Compiler:
This repository:

lamkina commented 2 months ago

@eirikurj @A-CGray Here are the details I've uncovered:

src/projections.F90 uses an uppercase 'F' extension and requires compiler pre-processing. This affects the line search because the line search parameters are defined as global constants using a #define statement:
```
! Define some parameters used for all projection functions
#define LSFailMax 2
#define wolfe .001
#define nLine 20
```
Our current compiler setup with ifx doesn't support the necessary pre-processing. I found this thread that somewhat explains the basics of the issue. This thread may also be relevant w.r.t compiler flags for pre-processing.

Either A) We figure out the correct compiler setup for the pre-processing or B) We slightly modify the projections so they don't require pre-processing. Let me know what you think.

A-CGray commented 2 months ago

Nice, thanks for finding this @lamkina . If this is as simple as changing what compiler flags are used to enable pre-processing, and the same flags will work for both ifort and ifx (and gcc?) then I'm fine with just doing that. Otherwise, maybe just switch to defining these as global constants so that we don't need to have another thing depend on the type of compiler we're using (I'm only familiar with doing this in C++ where I'd write something like const int LS_FAIL_MAX = 2;, but I assume you can do something similar in Fortran).

Also, I see adtProjections also has an F90 extension, but I didn't see any preprocessing commands in there, any idea what's going on there?

lamkina commented 2 months ago

I think the cleaner solution is to define these as constants and have all the extensions be lowercase f90.

I don't think the adtProjections need to be F90 so we can test that as part of this fix.

I'll wait for @eirikurj to comment then we can decide on the right way to fix this.

eirikurj commented 2 months ago

I took a look, and I am not convinced that this is a preprocessing issue. To explicitly enable the preprocessor we can include the flag -fpp. This however has no effect and tests still fail. Even removing the define statements by manually search-replace the variables with values should alleviate this issue. However, it does not seem to have any effect as the tests fail.

Disabling compile optimization, i.e., setting -O0, will make the tests pass on an unmodified code, i.e., with the define statements (no -fpp flag used here). Any other optimization setting will make the tests fail. This suggests that optimization is changing the behavior, resulting in a different branch/path being taken (based on values) on the optimized code. Adding the flag -fp-model=precise, disables some optimizations making the floating point comparison more strict and value-safe, and with this option all tests pass regardless of optimization level. I briefly tested this with ifort and the behavior seems to be more inline with what gfortran behavior. Adding the same flag to ifort makes the behavior consistent with ifx using the same flag.

In code, the divergence between ifx and gfortran happens here https://github.com/mdolab/pyspline/blob/b488ed5742edd7dea5c0d2f133974f30b9889e64/src/projections.F90#L153

Running one of the test/points locally, the state of the values are shown below,

Variable	ifx	gfortran
fval	0.973086514811139	0.97308651481113850
nfval	0.973086514811139	0.97308651481113850
pgrad	-4.045624659884222E-021	-4.0456103257696632E-021
step	1.00000000000000	1.00000000000000
c	0.000000000000000E+000	4.0456103257696632E-021

The inputs nfval,fval,step are the equal. pgrad is not equal, but it effectively zero (below machine zero). For ifx since c is zero, the subsequent step is Infinty, effectively breaking the remainder of the iterations, resulting in the test failing. For gfortran the value for c is some very small number. Setting -fp-model=precise for ifx will result in the same value final value as gfortran. However, even if the final result is the same, the path is still not the same. ifx will continue the iteration in the lineLoop until hitting the maximum number of iterations, but not improve on the point. gfortran and ifort both exit the lineLoop earlier, but it seems to be due to inexact numerics.

Aside from compiler tweaks, I think the issue (at least in part) is due to sufficient decrease of the Wolfe condition (see here https://github.com/mdolab/pyspline/blob/b488ed5742edd7dea5c0d2f133974f30b9889e64/src/projections.F90#L139), since technically it should include equals as well, i.e., be a <= comparison, not <. Adding this correction to the if statement thus terminates the loop before this becomes an issue (since the gradient is zero), and the behavior is the same across all compilers ifx, gfortran, and ifort. Granted, this is a small change in behavior, but I think practically there should be no change in final results, and it should avoid unnecessary iterations.

I dont think digging more into the compilers is worth it, aside from just adding the compiler flags I mention above, but feel as I think its important to study ifx since its a new compiler for us. Moving the define to a module is possible, but I dont think is needed. The primary thing is the stopping condition (sufficient decrease), which I have opened a PR for here https://github.com/mdolab/pyspline/pull/75.

Please test this carefully and let me know what you think.

marcomangano commented 2 months ago

Chiming in just to say that this is some incredible detective work! Updating the Wolfe condition here is a no-brainer. I wonder if we should look at other codes and see if there are "risky" if checks like this.

A-CGray commented 2 months ago

Wow @eirikurj this is a great find. I agree with your proposed fix in #75. On the preprocessing side, is -fpp on by default?

eirikurj commented 2 months ago

@A-CGray yes preprocessing is automatically applied to a file if it has one of these extensions shown below (gfortran added for reference also).

ifort - .fpp or .FPP, .f or .F, .f90 or .F90, .for or .FOR, .ftn or .FTN, and .fpp or .FPP. (reference)
ifx - .FPP, .fpp, .F, .F90, .FOR, or .FTN on Linux, and file extension .fpp on Windows. (reference)
gfortran - .F, .FOR, .FTN, .fpp, .FPP, .F90, .F95, .F03 or .F08 (reference)

Since we are using F90 for our source files, we should be good across all compilers we use.

mdolab / pyspline