Severe performance degradation when using AVX2 instructions

Hi there,

We are encountering an issue with odeint-v2 when compiled with support for AVX-2 instructions. We're aware that many issues of this kind will be compiler bugs rather than problems with odeint-v2 itself. However in this case we're seeing quantitatively different behaviour from the library when AVX-2 is in use, so even if the problem can be attributed to compiler issues we felt it might be useful to report it here.

We are using odeint-v2 to solve a modest number of coupled ODEs (~2000 equations). Each equation is rather complex and auto-generated using an intermediate computer algebra step. For certain parameter values and most compilers (see below) we are seeing the controlled error steppers take a factor of ~1E2 more steps when AVX-2 is enabled, in comparison with the number reported when compiled targeting instruction sets with <= AVX-1. The integration time per step is a little better with AVX-2 vs. AVX-1, but the significant increase in number of steps means that our integration times explode. Therefore we do not believe this is a hardware issue, for example caused by the CPU reducing its frequency when AVX-2 is enabled.

We are seeing this effect with range_kutta_dopri5, runge_kutta_cash-karp54, runge_kutta_fehlberg78, adams_bashforth_moulton and bulirsch_stoer_dense_out. The final results remain accurate; it is simply the amount of work being done by odeint-v2 when AVX-2 is enabled that changes. The effect is present in Clang8, gcc 7.2.0, gcc 9.1.0, gcc 10-dev and the Intel compiler icpc 18.0.3. However it is not present in Clang10 for which the number of steps remains the same and use of AVX-2 gives excellent performance.

We assume this means that, except in Clang10, use of AVX-2 is changing the way odeint-v2 (or perhaps library functions that odeint-v2 depends on) estimates the error in a single step. This presumably leads to the stepper shortening its stride and increasing the total number of steps. However, we haven't managed to understand which parts of the odeint-v2 error control would be susceptible to issues of this kind. Do you think this is a reasonable interpretation of the observed behaviour?

Unfortunately, because our ODE systems are auto-generated we have not yet succeeded in isolating a simple example that exhibits this issue.

headmyshoulder / odeint-v2

Severe performance degradation when using AVX2 instructions #246