m-a-d-n-e-s-s / madness

Multiresolution Adaptive Numerical Environment for Scientific Simulation
GNU General Public License v2.0
178 stars 61 forks source link

Reproducibility problem in moldft, likely due to race condition #103

Closed lratcliff closed 2 years ago

lratcliff commented 10 years ago

Two consecutive runs of moldft can have very different convergence behaviour - it starts out the same but by the second or third iteration the energies are very different. They end up at very similar energies, but the differences are such that one calculation might take twice as long as another. It occurs for both LDA (internal rather than libxc) and HF, and it's worse when starting from sto-3g basis rather than 6-31g. I'm running on my desktop without Elemental, compiled with gcc. If I run with MAD_NUM_THREADS=1 rather than 12 it behaves consistently, so I assume there's a race condition somewhere. Here's the input I used:

dft xc hf aobasis sto-3g end

geometry Na 1.31092303931025942E+01 1.31092303931026493E+01 1.31092303931017025E+01 Cl 1.78689078546959657E+01 1.78689078546958307E+01 1.78689078546942923E+01 end

robertjharrison commented 10 years ago

Do you have the outputs? What is the first significant difference? On Sep 18, 2014 3:21 PM, "lratcliff" notifications@github.com wrote:

Two consecutive runs of moldft can have very different convergence behaviour - it starts out the same but by the second or third iteration the energies are very different. They end up at very similar energies, but the differences are such that one calculation might take twice as long as another. It occurs for both LDA (internal rather than libxc) and HF, and it's worse when starting from sto-3g basis rather than 6-31g. I'm running on my desktop without Elemental, compiled with gcc. If I run with MAD_NUM_THREADS=1 rather than 12 it behaves consistently, so I assume there's a race condition somewhere. Here's the input I used:

dft xc hf aobasis sto-3g end

geometry Na 1.31092303931025942E+01 1.31092303931026493E+01 1.31092303931017025E+01 Cl 1.78689078546959657E+01 1.78689078546958307E+01 1.78689078546942923E+01 end

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103.

lratcliff commented 10 years ago

The first difference for this particular input is at iteration 1, where the energies are as follows: run1: kinetic 600.56952883 nonlocal psp 0.00000000 nuclear attraction -1506.81368675 coulomb 304.02275807 exchange-correlation -41.16559411 nuclear-repulsion 22.68315473 total -620.70383923 run2: kinetic 600.56952883 nonlocal psp 0.00000000 nuclear attraction -1506.81368675 coulomb 304.02275807 exchange-correlation -41.16559411 nuclear-repulsion 22.68315473 total -620.70383923

The difference then increases with each iteration. I'm not sure if there is an easy way to attach long outputs in github, but I'll reply to the mailing list with attachments.

lratcliff commented 10 years ago

Here are the outputs.


From: robertjharrison [notifications@github.com] Sent: Thursday, September 18, 2014 3:12 PM To: m-a-d-n-e-s-s/madness Cc: Ratcliff, Laura E. Subject: Re: [madness] Reproducibility problem in moldft, likely due to race condition (#103)

Do you have the outputs? What is the first significant difference? On Sep 18, 2014 3:21 PM, "lratcliff" notifications@github.com wrote:

Two consecutive runs of moldft can have very different convergence behaviour - it starts out the same but by the second or third iteration the energies are very different. They end up at very similar energies, but the differences are such that one calculation might take twice as long as another. It occurs for both LDA (internal rather than libxc) and HF, and it's worse when starting from sto-3g basis rather than 6-31g. I'm running on my desktop without Elemental, compiled with gcc. If I run with MAD_NUM_THREADS=1 rather than 12 it behaves consistently, so I assume there's a race condition somewhere. Here's the input I used:

dft xc hf aobasis sto-3g end

geometry Na 1.31092303931025942E+01 1.31092303931026493E+01 1.31092303931017025E+01 Cl 1.78689078546959657E+01 1.78689078546958307E+01 1.78689078546942923E+01 end

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103.

— Reply to this email directly or view it on GitHubhttps://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56095224.

lratcliff commented 10 years ago

I've tried to take a closer look at this problem by printing the trace of various quantities to a large number of d.p.s and it seems the differences first appear sooner than I initially thought. When running with 1 thread all quantities are identical to 16 d.p.s, whereas for multiple threads they start off identical until vlocal.reconstruct() in SCF::initial_guess. The difference in the trace of vlocal after the reconstruct is really small (~ 10^-12) but it gradually increases following various operations. There are also other instances where it seems the noise jumps a few orders of magnitude after reconstruct. So either this noise is expected (I'm not sure exactly how reconstruct works) and the algorithm is just really sensitive to the noise, or there's a problem in reconstruct or how/where it's being used.

robertjharrison commented 10 years ago

Most likely not reconstruct itself but the lack of a barrier somewhere before it. I will look at the outputs over the weekend. On Sep 19, 2014 6:21 PM, "lratcliff" notifications@github.com wrote:

I've tried to take a closer look at this problem by printing the trace of various quantities to a large number of d.p.s and it seems the differences first appear sooner than I initially thought. When running with 1 thread all quantities are identical to 16 d.p.s, whereas for multiple threads they start off identical until vlocal.reconstruct() in SCF::initial_guess. The difference in the trace of vlocal after the reconstruct is really small (~ 10^-12) but it gradually increases following various operations. There are also other instances where it seems the noise jumps a few orders of magnitude after reconstruct. So either this noise is expected (I'm not sure exactly how reconstruct works) and the algorithm is just really sensitive to the noise, or there's a problem in reconstruct or how/where it's being used.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56243495 .

robertjharrison commented 10 years ago

The source of the problem is the Coulomb potential ... it appears that something has broken in the integral operator. The results should indeed be reproducible to within rounding error until either numerical or real randomness causes slightly greater divergence (and there is a random number generator in the localization algorithm that will cause different paths for different numbers of processes/threads).

I will look deeper into the cause.

Robert

On Fri, Sep 19, 2014 at 6:42 PM, Robert Harrison rjharrison@gmail.com wrote:

Most likely not reconstruct itself but the lack of a barrier somewhere before it. I will look at the outputs over the weekend. On Sep 19, 2014 6:21 PM, "lratcliff" notifications@github.com wrote:

I've tried to take a closer look at this problem by printing the trace of various quantities to a large number of d.p.s and it seems the differences first appear sooner than I initially thought. When running with 1 thread all quantities are identical to 16 d.p.s, whereas for multiple threads they start off identical until vlocal.reconstruct() in SCF::initial_guess. The difference in the trace of vlocal after the reconstruct is really small (~ 10^-12) but it gradually increases following various operations. There are also other instances where it seems the noise jumps a few orders of magnitude after reconstruct. So either this noise is expected (I'm not sure exactly how reconstruct works) and the algorithm is just really sensitive to the noise, or there's a problem in reconstruct or how/where it's being used.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56243495 .

Robert J. Harrison tel: 865-272-9262

lratcliff commented 10 years ago

Alvaro was unable to reproduce the problem with the Intel compiler, so it might be a gcc problem only. I tried with -O0 as well though and that didn't improve matters, so I guess it isn't an optimization problem. Thanks for looking into this.

robertjharrison commented 10 years ago

I was using Intel. On Sep 25, 2014 10:56 AM, "lratcliff" notifications@github.com wrote:

Alvaro was unable to reproduce the problem with the Intel compiler, so it might be a gcc problem only. I tried with -O0 as well though and that didn't improve matters, so I guess it isn't an optimization problem. Thanks for looking into this.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56830904 .

lratcliff commented 10 years ago

Interesting, maybe he used a different version.

robertjharrison commented 10 years ago

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423 .

Robert J. Harrison tel: 865-272-9262

fbischoff commented 10 years ago

Will have a look into it.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

fbischoff commented 10 years ago

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

justusc commented 10 years ago

@robertjharrison I added a last_svn tag.

robertjharrison commented 10 years ago

The version I was checking was dated to about may 2013 ... sadly I have destroyed the SVN version info on that copy.

On Fri, Sep 26, 2014 at 9:28 AM, Justus Calvin notifications@github.com wrote:

@robertjharrison https://github.com/robertjharrison I added a last_svn tag.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56960983 .

Robert J. Harrison tel: 865-272-9262

justusc commented 10 years ago

The old svn repo is still present on Google Code. The best way to match svn revision numbers is by date/time in the git repo. It helps if you can view the history in a desktop application.

fbischoff commented 10 years ago

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff florian.bischoff@hu-berlin.de wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

naromero77 commented 10 years ago

Could it be some "const static" data issue?

On Fri, Sep 26, 2014 at 10:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399 .

Nichols A. Romero, Ph.D.

fbischoff commented 10 years ago

What would that issue be?

On Sep 26, 2014, at 5:38 PM, naromero77 notifications@github.com wrote:

Could it be some "const static" data issue?

On Fri, Sep 26, 2014 at 10:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399 .

Nichols A. Romero, Ph.D. — Reply to this email directly or view it on GitHub.

robertjharrison commented 10 years ago

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399 .

Robert J. Harrison tel: 865-272-9262

naromero77 commented 10 years ago

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison notifications@github.com wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056 .

Nichols A. Romero, Ph.D.

fbischoff commented 10 years ago

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

          kinetic     602.41910653

nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

          kinetic     602.41911695

nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison notifications@github.com wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison notifications@github.com wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff notifications@github.com wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

robertjharrison commented 10 years ago

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 notifications@github.com wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626 .

Robert J. Harrison tel: 865-272-9262

robertjharrison commented 10 years ago

Going too far back will find other errors and the version I have from 2013 is unambiguously working OK and as expected.

On Fri, Sep 26, 2014 at 12:13 PM, fbischoff notifications@github.com wrote:

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41910653 nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41911695 nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison notifications@github.com wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com> wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com> wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56983253 .

Robert J. Harrison tel: 865-272-9262

fbischoff commented 10 years ago

in the seprep branch I introduced new algorithms for the operator apply (recursive_apply, etc). One can switch back to the old one (apply), and get rid of a lot of potential erraneous code, but that didn't change anything for me. However, I noticed that the Coulomb operator has only 8 significant digits, and after addition to the nuclear potential, which is of the order 1e5 most digits are gone. Maybe there's a problem with the norm computations and screening in operator.h?

Will continue on this tomorrow.

On Sep 26, 2014, at 6:24 PM, robertjharrison notifications@github.com wrote:

Going too far back will find other errors and the version I have from 2013 is unambiguously working OK and as expected.

On Fri, Sep 26, 2014 at 12:13 PM, fbischoff notifications@github.com wrote:

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41910653 nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41911695 nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison notifications@github.com wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com> wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com> wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56983253 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

robertjharrison commented 10 years ago

I agree ... I too switched back to the old algorithm and the problem persisted.

8-sig fig is plenty for the Coulomb operator for the initial DFT ... probably overkill. The normwise error analysis of singular functions is not appropriate ... and empirically we get great agreement with standard tests.

I was also looking at the operator computation and norm estimation in the operator code ... but not for a math error ... for something that would give different answers if things were computed in different orders (and this does happen because of the way operators are cached and 2-scale is used to walk up/down).

On Fri, Sep 26, 2014 at 12:40 PM, fbischoff notifications@github.com wrote:

in the seprep branch I introduced new algorithms for the operator apply (recursive_apply, etc). One can switch back to the old one (apply), and get rid of a lot of potential erraneous code, but that didn't change anything for me. However, I noticed that the Coulomb operator has only 8 significant digits, and after addition to the nuclear potential, which is of the order 1e5 most digits are gone. Maybe there's a problem with the norm computations and screening in operator.h?

Will continue on this tomorrow.

On Sep 26, 2014, at 6:24 PM, robertjharrison notifications@github.com wrote:

Going too far back will find other errors and the version I have from 2013 is unambiguously working OK and as expected.

On Fri, Sep 26, 2014 at 12:13 PM, fbischoff notifications@github.com wrote:

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41910653 nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41911695 nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison notifications@github.com

wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com> wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com> wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56983253>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56986799 .

Robert J. Harrison tel: 865-272-9262

naromero77 commented 10 years ago

Just to confirm. Is it correct with a single thread?

On Fri, Sep 26, 2014 at 12:01 PM, robertjharrison notifications@github.com wrote:

I agree ... I too switched back to the old algorithm and the problem persisted.

8-sig fig is plenty for the Coulomb operator for the initial DFT ... probably overkill. The normwise error analysis of singular functions is not appropriate ... and empirically we get great agreement with standard tests.

I was also looking at the operator computation and norm estimation in the operator code ... but not for a math error ... for something that would give different answers if things were computed in different orders (and this does happen because of the way operators are cached and 2-scale is used to walk up/down).

On Fri, Sep 26, 2014 at 12:40 PM, fbischoff notifications@github.com wrote:

in the seprep branch I introduced new algorithms for the operator apply (recursive_apply, etc). One can switch back to the old one (apply), and get rid of a lot of potential erraneous code, but that didn't change anything for me. However, I noticed that the Coulomb operator has only 8 significant digits, and after addition to the nuclear potential, which is of the order 1e5 most digits are gone. Maybe there's a problem with the norm computations and screening in operator.h?

Will continue on this tomorrow.

On Sep 26, 2014, at 6:24 PM, robertjharrison notifications@github.com wrote:

Going too far back will find other errors and the version I have from 2013 is unambiguously working OK and as expected.

On Fri, Sep 26, 2014 at 12:13 PM, fbischoff notifications@github.com

wrote:

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41910653 nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41911695 nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison < notifications@github.com>

wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com> wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com> wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56983253>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56986799>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56989456 .

Nichols A. Romero, Ph.D.

robertjharrison commented 10 years ago

Yes. Current version is correct with 1 thread.

On Fri, Sep 26, 2014 at 1:06 PM, naromero77 notifications@github.com wrote:

Just to confirm. Is it correct with a single thread?

On Fri, Sep 26, 2014 at 12:01 PM, robertjharrison < notifications@github.com> wrote:

I agree ... I too switched back to the old algorithm and the problem persisted.

8-sig fig is plenty for the Coulomb operator for the initial DFT ... probably overkill. The normwise error analysis of singular functions is not appropriate ... and empirically we get great agreement with standard tests.

I was also looking at the operator computation and norm estimation in the operator code ... but not for a math error ... for something that would give different answers if things were computed in different orders (and this does happen because of the way operators are cached and 2-scale is used to walk up/down).

On Fri, Sep 26, 2014 at 12:40 PM, fbischoff notifications@github.com wrote:

in the seprep branch I introduced new algorithms for the operator apply (recursive_apply, etc). One can switch back to the old one (apply), and get rid of a lot of potential erraneous code, but that didn't change anything for me. However, I noticed that the Coulomb operator has only 8 significant digits, and after addition to the nuclear potential, which is of the order 1e5 most digits are gone. Maybe there's a problem with the norm computations and screening in operator.h?

Will continue on this tomorrow.

On Sep 26, 2014, at 6:24 PM, robertjharrison notifications@github.com

wrote:

Going too far back will find other errors and the version I have from 2013 is unambiguously working OK and as expected.

On Fri, Sep 26, 2014 at 12:13 PM, fbischoff < notifications@github.com>

wrote:

I also found the Coulomb potential to be the culprit, I will also continue to look at it. However I don't think that bisecting will work: the error appeared much earlier. This is what I got from 2400 of July 2011:

svninfo

Path: . URL: https://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk Repository Root: https://m-a-d-n-e-s-s.googlecode.com/svn Repository UUID: 8324f04f-9933-0410-b0a4-d546c816ed44 Revision: 2400 Node Kind: directory Schedule: normal Last Changed Author: justus.c79@gmail.com Last Changed Rev: 2400 Last Changed Date: 2011-07-03 17:09:32 +0200 (Sun, 03 Jul 2011)

config.status

configured by /users/flbi/devel/old/r2400/trunk/configure, generated by GNU Autoconf 2.63, with options \"'LDFLAGS=-L/usr/lib64 -L/opt/intel/111_20100414/mkl/lib/em64t' 'LIBS=-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm' 'MPICC=/users/flbi/devel/install/mvapich2-1.8/bin/mpicc' 'MPICXX=/users/flbi/devel/install/mvapich2-1.8/bin/mpicxx' 'CC=/usr/bin/gcc-4.7 -m64' 'CXX=/usr/bin/g++-4.7 -m64' 'F77=/usr/bin/gfortran-4.7 -m64'\"

run 1:

Iteration 0 at time 4.1s

warning: boys localization did not fully converge: 376 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.25s 0.25s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41910653 nuclear attraction -1507.17761477 coulomb 309.39466089 exchange-correlation -40.26463997 nuclear-repulsion 22.68315473 total -612.94533258

run 2: Iteration 0 at time 4.2s

warning: boys localization did not fully converge: 674 timer: Boys localize 0.57s 0.57s timer: Make densities 0.09s 0.09s timer: Load balancing 0.00s 0.00s timer: Coulomb 0.28s 0.28s timer: LDA potential 0.06s 0.06s timer: V*psi 0.35s 0.35s timer: Truncate Vpsi 0.40s 0.40s timer: PE matrix 0.04s 0.04s timer: KE matrix 0.56s 0.56s

kinetic 602.41911695 nuclear attraction -1507.17759731 coulomb 309.39464838 exchange-correlation -40.26463879 nuclear-repulsion 22.68315473 total -612.94531604

On Sep 26, 2014, at 6:03 PM, robertjharrison < notifications@github.com>

wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com> wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com> wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56983253>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56986799>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56989456>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56990161 .

Robert J. Harrison tel: 865-272-9262

fbischoff commented 10 years ago

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison notifications@github.com wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 notifications@github.com wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff notifications@github.com wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

robertjharrison commented 10 years ago

Good catch .. Hopefully.

The integral convolution algorithm has no numerical instabilities ... I.e. there are NO places where we can get significant numerical cancelations (at least for Coulomb). The results should be (and WERE) fully reproducible despite computing to low but controlled precision.

Also if the norm estimation is merely inaccurate we should still reproducibly get the same wrong answer.

The key thing is if the imprecise norm computation is also dependent upon order of computation (e.g. uses a random number and does not completely converge or has an uninitialized variable etc)

I'll look at the code you point out and see how it relates to the previous code we had. I will also test the new and old codes for speed.

The goal is fast and accurate and previously this was only accomplished thru paranoid testing. On Sep 27, 2014 5:18 PM, "fbischoff" notifications@github.com wrote:

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison notifications@github.com wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 notifications@github.com wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-57057543 .

robertjharrison commented 10 years ago

I am still tracking this down and it is tied in with a very substantial slow down of the code ... in addition to taking many more iterations to converge due to the numerical errors each iteration is taking over twice as long. The old working code runs NaCL reliably in about 85s with the last iteration taking 10s whereas the current version takes 398s with the last iteration taking 29s.

Nearly every operation is taking a lot longer which to me is indicative of numerical noise causing the wave functions to use a lot more coefficients than would be necessary without the noise.

On Sat, Sep 27, 2014 at 6:13 PM, Robert Harrison rjharrison@gmail.com wrote:

Good catch .. Hopefully.

The integral convolution algorithm has no numerical instabilities ... I.e. there are NO places where we can get significant numerical cancelations (at least for Coulomb). The results should be (and WERE) fully reproducible despite computing to low but controlled precision.

Also if the norm estimation is merely inaccurate we should still reproducibly get the same wrong answer.

The key thing is if the imprecise norm computation is also dependent upon order of computation (e.g. uses a random number and does not completely converge or has an uninitialized variable etc)

I'll look at the code you point out and see how it relates to the previous code we had. I will also test the new and old codes for speed.

The goal is fast and accurate and previously this was only accomplished thru paranoid testing. On Sep 27, 2014 5:18 PM, "fbischoff" notifications@github.com wrote:

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison notifications@github.com wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 notifications@github.com

wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-57057543 .

Robert J. Harrison tel: 865-272-9262

fbischoff commented 10 years ago

For making the 6D code work I had to adapt the thresholds so that they take care of the dimensionality in truncate_tol(). That had a significant impact on the performance:

/// here is our handwaving argument:
/// this threshold will give each FunctionNode an error of less than tol. The
/// total error can then be as high as sqrt(#nodes) * tol. Therefore in order
/// to account for higher dimensions: divide tol by about the root of number
/// of siblings (2^NDIM) that have a large error when we refine along a deep
/// branch of the tree.

.. const static double fac=1.0/std::pow(2,NDIM0.5); tol=fac;

Florian

On Oct 10, 2014, at 12:58 PM, robertjharrison notifications@github.com wrote:

I am still tracking this down and it is tied in with a very substantial slow down of the code ... in addition to taking many more iterations to converge due to the numerical errors each iteration is taking over twice as long. The old working code runs NaCL reliably in about 85s with the last iteration taking 10s whereas the current version takes 398s with the last iteration taking 29s.

Nearly every operation is taking a lot longer which to me is indicative of numerical noise causing the wave functions to use a lot more coefficients than would be necessary without the noise.

On Sat, Sep 27, 2014 at 6:13 PM, Robert Harrison rjharrison@gmail.com wrote:

Good catch .. Hopefully.

The integral convolution algorithm has no numerical instabilities ... I.e. there are NO places where we can get significant numerical cancelations (at least for Coulomb). The results should be (and WERE) fully reproducible despite computing to low but controlled precision.

Also if the norm estimation is merely inaccurate we should still reproducibly get the same wrong answer.

The key thing is if the imprecise norm computation is also dependent upon order of computation (e.g. uses a random number and does not completely converge or has an uninitialized variable etc)

I'll look at the code you point out and see how it relates to the previous code we had. I will also test the new and old codes for speed.

The goal is fast and accurate and previously this was only accomplished thru paranoid testing. On Sep 27, 2014 5:18 PM, "fbischoff" notifications@github.com wrote:

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison notifications@github.com wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 notifications@github.com

wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-57057543 .

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

robertjharrison commented 10 years ago

Well we need to remedy that because it is clearly doing a lot of unnecessary work.

Once I find the source of the numerical/thread problems i will try to understand the tolerances.

Minimally multiplying FAC by 8 will restore the old behavior in 3d On Oct 10, 2014 9:00 AM, "fbischoff" notifications@github.com wrote:

For making the 6D code work I had to adapt the thresholds so that they take care of the dimensionality in truncate_tol(). That had a significant impact on the performance:

/// here is our handwaving argument: /// this threshold will give each FunctionNode an error of less than tol. The /// total error can then be as high as sqrt(#nodes) * tol. Therefore in order /// to account for higher dimensions: divide tol by about the root of number /// of siblings (2^NDIM) that have a large error when we refine along a deep /// branch of the tree. .. const static double fac=1.0/std::pow(2,NDIM0.5); tol=fac;

Florian

On Oct 10, 2014, at 12:58 PM, robertjharrison notifications@github.com wrote:

I am still tracking this down and it is tied in with a very substantial slow down of the code ... in addition to taking many more iterations to converge due to the numerical errors each iteration is taking over twice as long. The old working code runs NaCL reliably in about 85s with the last iteration taking 10s whereas the current version takes 398s with the last iteration taking 29s.

Nearly every operation is taking a lot longer which to me is indicative of numerical noise causing the wave functions to use a lot more coefficients than would be necessary without the noise.

On Sat, Sep 27, 2014 at 6:13 PM, Robert Harrison rjharrison@gmail.com wrote:

Good catch .. Hopefully.

The integral convolution algorithm has no numerical instabilities ... I.e. there are NO places where we can get significant numerical cancelations (at least for Coulomb). The results should be (and WERE) fully reproducible despite computing to low but controlled precision.

Also if the norm estimation is merely inaccurate we should still reproducibly get the same wrong answer.

The key thing is if the imprecise norm computation is also dependent upon order of computation (e.g. uses a random number and does not completely converge or has an uninitialized variable etc)

I'll look at the code you point out and see how it relates to the previous code we had. I will also test the new and old codes for speed.

The goal is fast and accurate and previously this was only accomplished thru paranoid testing. On Sep 27, 2014 5:18 PM, "fbischoff" notifications@github.com wrote:

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison < notifications@github.com> wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 < notifications@github.com>

wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-57057543>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-58651846 .

robertjharrison commented 10 years ago

The reproducibility problem for NaCl has been fixed and it was likely fairly specific to that system ... in the canonical algorithm I had swapped the enforcement of positive phases and elimination of rotations among degenerate eigenvectors but I think there was an assumption in the rotation algorithm that the phases were positive. Anyway, I can now run NaCl with canonical orbitals and get reproducible results regardless of thread counts.

The code is still very slow and I am chasing this down ... I will start another thread about this.

On Fri, Oct 10, 2014 at 9:58 AM, Robert Harrison rjharrison@gmail.com wrote:

Well we need to remedy that because it is clearly doing a lot of unnecessary work.

Once I find the source of the numerical/thread problems i will try to understand the tolerances.

Minimally multiplying FAC by 8 will restore the old behavior in 3d On Oct 10, 2014 9:00 AM, "fbischoff" notifications@github.com wrote:

For making the 6D code work I had to adapt the thresholds so that they take care of the dimensionality in truncate_tol(). That had a significant impact on the performance:

/// here is our handwaving argument: /// this threshold will give each FunctionNode an error of less than tol. The /// total error can then be as high as sqrt(#nodes) * tol. Therefore in order /// to account for higher dimensions: divide tol by about the root of number /// of siblings (2^NDIM) that have a large error when we refine along a deep /// branch of the tree. .. const static double fac=1.0/std::pow(2,NDIM0.5); tol=fac;

Florian

On Oct 10, 2014, at 12:58 PM, robertjharrison notifications@github.com wrote:

I am still tracking this down and it is tied in with a very substantial slow down of the code ... in addition to taking many more iterations to converge due to the numerical errors each iteration is taking over twice as long. The old working code runs NaCL reliably in about 85s with the last iteration taking 10s whereas the current version takes 398s with the last iteration taking 29s.

Nearly every operation is taking a lot longer which to me is indicative of numerical noise causing the wave functions to use a lot more coefficients than would be necessary without the noise.

On Sat, Sep 27, 2014 at 6:13 PM, Robert Harrison rjharrison@gmail.com

wrote:

Good catch .. Hopefully.

The integral convolution algorithm has no numerical instabilities ... I.e. there are NO places where we can get significant numerical cancelations (at least for Coulomb). The results should be (and WERE) fully reproducible despite computing to low but controlled precision.

Also if the norm estimation is merely inaccurate we should still reproducibly get the same wrong answer.

The key thing is if the imprecise norm computation is also dependent upon order of computation (e.g. uses a random number and does not completely converge or has an uninitialized variable etc)

I'll look at the code you point out and see how it relates to the previous code we had. I will also test the new and old codes for speed.

The goal is fast and accurate and previously this was only accomplished thru paranoid testing. On Sep 27, 2014 5:18 PM, "fbischoff" notifications@github.com wrote:

I did some more tests, and I think we might not have a code bug, but numerical instabilities.

When I run the current git code (with -O3), but with apply instead of recursive_apply, I get deviations in the trace of the coulomb potential in the guess of the order 1.e-8.

test_imprecise-operator1.out:coulomb 8.055184415233437903e+05 (8 threads) test_imprecise-operator2.out:coulomb 8.055184418090274557e+05 (8 threads) test_imprecise-operator_t1.out:coulomb 8.055184414484518347e+05 (1 thread)

When I repeat doing more accurate norm estimation in operator.h (always use the block in operator.h:571-584) I get identical results:

test_precise-operator1.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator2.out:coulomb 8.055184442296256311e+05 (8 threads) test_precise-operator_t1.outcoulomb 8.055184442296256311e+05 (1 thread)

So the error in the Coulomb potential might just be numerical noise, depending on the order the displacements are executed in the operator apply.

Also if I increase the precision for 1.e-4 (default in the compute protocol) to 1.e-6 the iterations are more stable. The same is true if molecules with lighter elements are computed. I guess in this case the operator screening is less sensitive wrt the operator norm, since the source coefficient norms are smaller.

On Sep 26, 2014, at 6:17 PM, robertjharrison < notifications@github.com> wrote:

I have no clue what the issue is except that it is tied to the application of the operator. The structure of the resulting tree is correct, its just that a few boxes have slightly incorrect numbers in it.

I doubt that it is static data that is the problem. Could be, but doubtful.

On Fri, Sep 26, 2014 at 12:08 PM, naromero77 < notifications@github.com>

wrote:

Robert,

Have you narrowed it down to a few files or even functions?

About a year ago, I ran the Intel Static Analyzer of the MADNESS test suite and it identified many things (some of them were the same issue over and over again) that was potentially problematic. I can take a look at the reports, but it would be easier to parse if you have narrowed things down for me. I could also just sent you those reports.

On Fri, Sep 26, 2014 at 11:03 AM, robertjharrison < notifications@github.com> wrote:

I have confirmed that the old version from circa May 2013 is 100% reproducible down +/-2 in the last digit printed in the energy of the first iteration ... and this is what it should be!

This is using icc with either -O1 or -O3 compilation options and no tcmalloc. So optimization seems not to be the problem. Probably not malloc too, though that still might tie in.

In contrast the current version (with -O3 and no tcmalloc) is all over the map on the first iteration (roughly 1e-4 fluctuations).

Again, it is computation of the initial guess Coulomb potential that is the first symptom.

I have looked at some obvious candidates for computation order or thread-safety dependent changes in the matrix elements but cannot find anything.

So what is left is bisecting thru to find the guilty party ... hopefully it will be a clear error instead of a subtle exposure of a deeper issue.

I can hopefully do this over the weekend.

Robert

On Fri, Sep 26, 2014 at 11:01 AM, fbischoff < notifications@github.com> wrote:

I found another example, H2S, for which the same code with different mallocs gives different results:

debug works debug/tcmalloc works nodebug fails nodebug/tcmalloc works

Each were 3 independent runs. The two debug versions give also identical results, but the nodebug/tmalloc is different (however, consistent with itself). For NaCl all versions failed. With only 1 thread everything always works.

To me this looks fairly weird, maybe someone else has some idea?

On Sep 26, 2014, at 11:14 AM, Florian Bischoff < florian.bischoff@hu-berlin.de> wrote:

I've checked some more revisions back to revision 2400, and the error was present in all cases. Revisions before that I could not compile (something with wrap_opaque). If only 1 thread was used the error vanished. So the merge with the seprep branch was not the cause.

On Sep 26, 2014, at 4:58 AM, robertjharrison < notifications@github.com>

wrote:

It seems to be working if I use a version prior to the merge with seprep.

Could someone else check please? You will have to back to the google code repo unless U can figure out how to check out old SVN version from within github (it is supposed to be there but ...).

On Thu, Sep 25, 2014 at 11:31 AM, lratcliff < notifications@github.com>

wrote:

Interesting, maybe he used a different version.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56836423>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56973399>

.

Robert J. Harrison tel: 865-272-9262

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982056>

.

Nichols A. Romero, Ph.D.

— Reply to this email directly or view it on GitHub <

https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-56982626>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-57057543>

.

Robert J. Harrison tel: 865-272-9262 — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-58651846 .

Robert J. Harrison tel: 865-272-9262

lratcliff commented 10 years ago

Has this fix been checked in? I just tried rerunning the test and it's much better than before but there are still some small differences between runs - of the order of 10^-7 for total energy and up to 10^-5 in various components. This is about the same level as I've seen for other systems so maybe this is a separate problem?

robertjharrison commented 10 years ago

Yes the fix is in and the variation you note is fine. Such variations result from the path of computation diverging slightly due to different rounding errors (e.g., localized orbitals slight different). The rate and path of convergence should be overall very similar and the converged energy in very good agreement. The energy is more accurate than the dipole, eigenvalues, etc due to its variational property. On Oct 13, 2014 6:05 PM, "lratcliff" notifications@github.com wrote:

Has this fix been checked in? I just tried rerunning the test and it's much better than before but there are still some small differences between runs - of the order of 10^-7 for total energy and up to 10^-5 in various components. This is about the same level as I've seen for other systems so maybe this is a separate problem?

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-58961972 .

naromero77 commented 9 years ago

Is this still an issue?

robertjharrison commented 9 years ago

Not in any of my tests, but I have not had opportunity to test on large processor counts.

On Sun, Feb 8, 2015 at 2:06 PM, Nichols A. Romero notifications@github.com wrote:

Is this still an issue?

— Reply to this email directly or view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/103#issuecomment-73425756 .

Robert J. Harrison tel: 865-272-9262

robertjharrison commented 2 years ago

All tests current seem reproducible.