milc-qcd / milc_qcd

MILC collaboration code for lattice QCD calculations
Other
37 stars 32 forks source link

Apparently rapid accumulation of roundoff errors #44

Open daschaich opened 3 years ago

daschaich commented 3 years ago

I'm working with a student (@FelixSpr) on a pure-gauge project, and we've become puzzled by apparently rapid accumulation of roundoff errors in our code based on MILC's pg_ora (over-relaxed quasi-heatbath).

I have now set up a minimal example of this behavior in a new fork of milc_qcd, which you can see as commit 99857446 of daschaich/milc_qcd. Here, after successfully reproducing an existing 4^3x8 test in double precision on a single core (commit 19b2c53), I simply change division by z (a double) to multiplication by 1.0/z.

Despite the tiny lattice volume, after only five 'trajecs' (each with four over-relaxation steps and four quasi-heatbath steps) roundoff is visible in the printed GMES output, and within 20 trajecs this has accumulated to roughly percent-level effects in the plaquette. In earlier tests using our own code that show this behavior, I have checked that this division-to-multiplication step itself produces changes only at the 1e-16 level I would expect in double precision. I have also checked that turning off the over-relaxation steps reduces but doesn't completely remove this roundoff accumulation.

So it looks as though there is some aspect of the pg_ora program that is causing machine-precision roundoff errors to be rapidly blown up by many orders of magnitude. I find this surprising, but I haven't done much prior work with this algorithm. Is this the expected behavior? If not, can you think of any likely culprit(s) deserving more scrutiny? I'll ping @weberjo8 since you seem to have been working on the pure-gauge code relatively recently.

detar commented 3 years ago

Hi David,

This is very strange, indeed.  We have not used this part of the code for some 25 years.   Is the misbehavior compiler or machine dependent?  I'll take a look when I get a chance.

Best,

Carleton

On 1/26/21 6:46 PM, David Schaich wrote:

I'm working with a student (@FelixSpr https://github.com/FelixSpr) on a pure-gauge project, and we've become puzzled by apparently rapid accumulation of roundoff errors in our code https://github.com/daschaich/SU4/tree/dev based on MILC's |pg_ora| (over-relaxed quasi-heatbath).

I have now set up a minimal example of this behavior in a new fork of |milc_qcd|, which you can see as commit 9985744 https://github.com/milc-qcd/milc_qcd/commit/998574466af444bcf3c6aa16e218dca6ddb19a67 of daschaich/milc_qcd https://github.com/daschaich/milc_qcd/commit/99857446. Here, after successfully reproducing an existing 4^3x8 test in double precision on a single core (commit 19b2c53 https://github.com/milc-qcd/milc_qcd/commit/19b2c5382320e53d154223b27321bd486ecf89f4), I simply change division by |z| (a |double|) to multiplication by |1.0/z|.

Despite the tiny lattice volume, after only five 'trajecs' (each with four over-relaxation steps and four quasi-heatbath steps) roundoff is visible in the printed |GMES| output, and within 20 trajecs this has accumulated to roughly percent-level effects in the plaquette. In earlier tests using our own code that show this behavior, I have checked that this division-to-multiplication step itself produces changes only at the |1e-16| level I would expect in double precision. I have also checked that turning off the over-relaxation steps reduces but doesn't completely remove this roundoff accumulation.

So it looks as though there is some aspect of the |pg_ora| program that is causing machine-precision roundoff errors to be rapidly blown up by many orders of magnitude. I find this surprising, but I haven't done much prior work with this algorithm. Is this the expected behavior? If not, can you think of any likely culprit(s) deserving more scrutiny? I'll ping @weberjo8 https://github.com/weberjo8 since you seem to have been working on the pure-gauge code relatively recently.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/milc-qcd/milc_qcd/issues/44, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXQJQF245OA6OIE4IFDS35V6HANCNFSM4WUNUMGQ.

weberjo8 commented 3 years ago

Hi David,

Sorry, I never looked into any details of the ORA algorithm.

I just generalized features of the anisotropic version of the pure gauge codes and

tested against existing sample outputs and repeated some nontrivial unit tests that

Alexei had suggested.

Best regards

Johannes

On 1/27/21 2:24 PM, Carleton DeTar wrote:

Hi David,

This is very strange, indeed.  We have not used this part of the code for some 25 years.   Is the misbehavior compiler or machine dependent? I'll take a look when I get a chance.

Best,

Carleton

On 1/26/21 6:46 PM, David Schaich wrote:

I'm working with a student (@FelixSpr https://github.com/FelixSpr) on a pure-gauge project, and we've become puzzled by apparently rapid accumulation of roundoff errors in our code https://github.com/daschaich/SU4/tree/dev based on MILC's |pg_ora| (over-relaxed quasi-heatbath).

I have now set up a minimal example of this behavior in a new fork of |milc_qcd|, which you can see as commit 9985744

https://github.com/milc-qcd/milc_qcd/commit/998574466af444bcf3c6aa16e218dca6ddb19a67

of daschaich/milc_qcd https://github.com/daschaich/milc_qcd/commit/99857446. Here, after successfully reproducing an existing 4^3x8 test in double precision on a single core (commit 19b2c53

https://github.com/milc-qcd/milc_qcd/commit/19b2c5382320e53d154223b27321bd486ecf89f4),

I simply change division by |z| (a |double|) to multiplication by |1.0/z|.

Despite the tiny lattice volume, after only five 'trajecs' (each with four over-relaxation steps and four quasi-heatbath steps) roundoff is visible in the printed |GMES| output, and within 20 trajecs this has accumulated to roughly percent-level effects in the plaquette. In earlier tests using our own code that show this behavior, I have checked that this division-to-multiplication step itself produces changes only at the |1e-16| level I would expect in double precision. I have also checked that turning off the over-relaxation steps reduces but doesn't completely remove this roundoff accumulation.

So it looks as though there is some aspect of the |pg_ora| program that is causing machine-precision roundoff errors to be rapidly blown up by many orders of magnitude. I find this surprising, but I haven't done much prior work with this algorithm. Is this the expected behavior? If not, can you think of any likely culprit(s) deserving more scrutiny? I'll ping @weberjo8 https://github.com/weberjo8 since you seem to have been working on the pure-gauge code relatively recently.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/milc-qcd/milc_qcd/issues/44, or unsubscribe

https://github.com/notifications/unsubscribe-auth/ABP6HXQJQF245OA6OIE4IFDS35V6HANCNFSM4WUNUMGQ.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/milc-qcd/milc_qcd/issues/44#issuecomment-768282823, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJ4MWFYB4UIYCJOYZGLRZLS4AHZZANCNFSM4WUNUMGQ.

-- Dr.rer.nat. Johannes Heinrich Weber AG Theoretische Teilchenphysik - Gitterfeldtheorie, Institut für Physik IRIS Adlershof, Zum Großen Windkanal 6, D-12489 Berlin Humboldt Universität, Berlin, Germany dr.rer.nat.weber@gmail.com

daschaich commented 3 years ago

Thanks Carleton and Johannes, I'm glad I'm not overlooking something immediately obvious.

So far this has occurred wherever we've run and however we've compiled. But that's a short list at the moment---just on laptops and on a local cluster.

By coincidence both my laptop and the cluster were using the same gcc 5.5.0 (with openmpi 1.10.7 for parallel running on the cluster). The cluster has some other compilers available, and on it I just reran the minimal example linked above using gcc 8.3.0 and intel (& intel-mpi) 2019u5.

I see the same qualitative sensitivity to roundoff in all cases, though the specific numbers are different for gnu vs. intel compilers. That is, each of {gcc 5.5.0 scalar; gcc 5.5.0 + openmpi 1.10.7 on 2 cores; gcc 8.3.0 scalar} produce the same GMES output as linked above, while {intel 2019u5 scalar; intel 2019u5 + intel-mpi 2019u5 on 2 cores} produce different output that still changes for division by z vs. multiplication by 1.0/z. In commits 3900ff6 and d71df02 I added the corresponding output for comparison.

detar commented 3 years ago

Hi David,

Divisions typically cost quite a bit more processor cycles than multiplications.   Is the effect sensitive to the optimization level?  Did you try -O3 -ffast-math, or the equivalent, for example.  This isn't a bug in the code, of course, but it sounds like it is a change worth making.

Best,

Carleton

On 1/27/21 8:28 AM, David Schaich wrote:

Thanks Carleton and Johannes, I'm glad I'm not overlooking something immediately obvious.

So far this has occurred wherever we've run and however we've compiled. But that's a short list at the moment---just on laptops and on a local cluster.

By coincidence both my laptop and the cluster were using the same gcc 5.5.0 (with openmpi 1.10.7 for parallel running on the cluster). The cluster has some other compilers available, and on it I just reran the minimal example linked above using gcc 8.3.0 and intel (& intel-mpi) 2019u5.

I see the same qualitative sensitivity to roundoff in all cases, though the specific numbers are different for gnu vs. intel compilers. That is, each of {gcc 5.5.0 scalar; gcc 5.5.0 + openmpi 1.10.7 on 2 cores; gcc 8.3.0 scalar} produce the same GMES output as linked above, while {intel 2019u5 scalar; intel 2019u5 + intel-mpi 2019u5 on 2 cores} produce different output that still changes for division by |z| vs. multiplication by |1.0/z|. In commits 3900ff6 https://github.com/milc-qcd/milc_qcd/commit/3900ff6a733561a13f6eb96e4d1844f3127ed485 and d71df02 https://github.com/milc-qcd/milc_qcd/commit/d71df02643fb334f2cb73ca978221f445d6bfd91 I added the corresponding output for comparison.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/milc-qcd/milc_qcd/issues/44#issuecomment-768364070, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXVD6V7QXXLYS7AEEVTS4AWLLANCNFSM4WUNUMGQ.

daschaich commented 3 years ago

Sorry for the slow response.

All the tests above used -O3, partly out of habit, partly to introduce as few changes as possible. I have now checked that the gcc 5.5.0 + openmpi 1.10.7 print exactly the same output with -O0 (a073b57f).

I also tried turning off -DFAST (ba7388c0, still with -O0 and gcc 5.5.0 + openmpi 1.10.7). Although this did change the output, sensitivity to division by z vs. multiplication by 1.0/z remains (just as happened when testing Intel compilers).

detar commented 3 years ago

Hi David,

If you want to dig deeper, I guess you'd need to look at the assembly code.

Best,

Carleton

On 1/31/21 9:28 AM, David Schaich wrote:

Sorry for the slow response.

All the tests above used |-O3|, partly out of habit, partly to introduce as few changes as possible. I have now checked that the gcc 5.5.0 + openmpi 1.10.7 print exactly the same output with |-O0| (a073b57 https://github.com/milc-qcd/milc_qcd/commit/a073b57f64e067876e6ff81c3625f21744d67ebf).

I also tried turning off |-DFAST| (ba7388c https://github.com/milc-qcd/milc_qcd/commit/ba7388c053f4122d8fe959957ca7c0a43e5286fe, still with |-O0| and gcc 5.5.0 + openmpi 1.10.7). Although this did change the output, sensitivity to division by |z| vs. multiplication by |1.0/z| remains (just as happened when testing Intel compilers).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/milc-qcd/milc_qcd/issues/44#issuecomment-770408538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXRS542KYR5BPIK7W4DS4WAK3ANCNFSM4WUNUMGQ.