shankar1729 / jdftx

JDFTx: software for joint density functional theory
http://jdftx.org
82 stars 54 forks source link

compile on NERSC #69

Closed KwonSnow closed 5 years ago

KwonSnow commented 5 years ago

JDFTx-NERSC-results.zip

Hello.

I tried to compile JDFTx on Cori (NERSC). I didn't find any big problem from the output I attached, but none of the tests were successful. Would you please take a look? JDFTx-NERSC-results.zip

Sincerely, Soonho

shankar1729 commented 5 years ago

Hi Soonho,

Did you perhaps run "make test" directly on the login / frontend node? If I remember correctly, NERSC prevents any MPI-compiled executables from running on the nodes, so that's why all the tests failed immediately. Try running "make test" within a debug job to avoid this. You might want to set the JDFTX_LAUNCH variable within the script; see http://jdftx.org/Testing.html for more options.

Best, Shankar

KwonSnow commented 5 years ago

Hello, Dr. Shankar.

I tried to test JDFTx through an interactive job with debug QOS. But, even after allocating a single node, same results came out. I attached the results following this email. When I submitted a regular job through a batch job, it canceled when the code allocating electronic variables. I also tried using less processors to reduce the memory requirement, but still, it doesn't work.

Thank you.

Sincerely, Soonho

shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> module load gsl cray-fftw shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> make testclean Built target testclean shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> export JDFTX_LAUNCH="mpirun -n %d" shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> make test Running tests... Test project /global/u1/s/shkwon/jdftx-1.4.2/build Start 1: openShell 1/10 Test #1: openShell ........................Failed 0.20 sec Start 2: vibrations 2/10 Test #2: vibrations .......................Failed 0.18 sec Start 3: moleculeSolvation 3/10 Test #3: moleculeSolvation ................Failed 0.18 sec Start 4: ionSolvation 4/10 Test #4: ionSolvation .....................Failed 0.17 sec Start 5: latticeOpt 5/10 Test #5: latticeOpt .......................Failed 0.20 sec Start 6: metalBulk 6/10 Test #6: metalBulk ........................Failed 0.18 sec Start 7: plusU 7/10 Test #7: plusU ............................Failed 0.17 sec Start 8: spinOrbit 8/10 Test #8: spinOrbit ........................Failed 0.20 sec Start 9: graphene 9/10 Test #9: graphene .........................Failed 0.16 sec Start 10: metalSurface 10/10 Test #10: metalSurface .....................Failed 0.16 sec

0% tests passed, 10 tests failed out of 10

Total Test time (real) = 2.05 sec

The following tests FAILED: 1 - openShell (Failed) 2 - vibrations (Failed) 3 - moleculeSolvation (Failed) 4 - ionSolvation (Failed) 5 - latticeOpt (Failed) 6 - metalBulk (Failed) 7 - plusU (Failed) 8 - spinOrbit (Failed) 9 - graphene (Failed) 10 - metalSurface (Failed) Errors while running CTest make: *** [Makefile:84: test] Error 8

2019년 8월 12일 (월) 오전 10:47, Ravishankar Sundararaman < notifications@github.com>님이 작성:

Hi Soonho,

Did you perhaps run "make test" directly on the login / frontend node? If I remember correctly, NERSC prevents any MPI-compiled executables from running on the nodes, so that's why all the tests failed immediately. Try running "make test" within a debug job to avoid this. You might want to set the JDFTX_LAUNCH variable within the script; see http://jdftx.org/Testing.html for more options.

Best, Shankar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN22PZ37IRLFWX72VZLQEGO25A5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4DJNLQ#issuecomment-520525486, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN7A5O7SBKSX6EEMLALQEGO25ANCNFSM4IK2WSKQ .

KwonSnow commented 5 years ago

I used a single knl node.

shkwon@cori10:/global/u1/s/shkwon/jdftx-1.4.2/build> salloc -N 1 -q debug -C knl salloc: Pending job allocation 23795504 salloc: job 23795504 queued and waiting for resources salloc: job 23795504 has been allocated resources salloc: Granted job allocation 23795504 salloc: Waiting for resource configuration salloc: Nodes nid02518 are ready for job

2019년 8월 13일 (화) 오전 9:03, Soonho Kwon kwonsnow312@gmail.com님이 작성:

Hello, Dr. Shankar.

I tried to test JDFTx through an interactive job with debug QOS. But, even after allocating a single node, same results came out. I attached the results following this email. When I submitted a regular job through a batch job, it canceled when the code allocating electronic variables. I also tried using less processors to reduce the memory requirement, but still, it doesn't work.

Thank you.

Sincerely, Soonho

shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> module load gsl cray-fftw shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> make testclean Built target testclean shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> export JDFTX_LAUNCH="mpirun -n %d" shkwon@nid02518:/global/u1/s/shkwon/jdftx-1.4.2/build> make test Running tests... Test project /global/u1/s/shkwon/jdftx-1.4.2/build Start 1: openShell 1/10 Test #1: openShell ........................Failed 0.20 sec Start 2: vibrations 2/10 Test #2: vibrations .......................Failed 0.18 sec Start 3: moleculeSolvation 3/10 Test #3: moleculeSolvation ................Failed 0.18 sec Start 4: ionSolvation 4/10 Test #4: ionSolvation .....................Failed 0.17 sec Start 5: latticeOpt 5/10 Test #5: latticeOpt .......................Failed 0.20 sec Start 6: metalBulk 6/10 Test #6: metalBulk ........................Failed 0.18 sec Start 7: plusU 7/10 Test #7: plusU ............................Failed 0.17 sec Start 8: spinOrbit 8/10 Test #8: spinOrbit ........................Failed 0.20 sec Start 9: graphene 9/10 Test #9: graphene .........................Failed 0.16 sec Start 10: metalSurface 10/10 Test #10: metalSurface .....................Failed 0.16 sec

0% tests passed, 10 tests failed out of 10

Total Test time (real) = 2.05 sec

The following tests FAILED: 1 - openShell (Failed) 2 - vibrations (Failed) 3 - moleculeSolvation (Failed) 4 - ionSolvation (Failed) 5 - latticeOpt (Failed) 6 - metalBulk (Failed) 7 - plusU (Failed) 8 - spinOrbit (Failed) 9 - graphene (Failed) 10 - metalSurface (Failed) Errors while running CTest make: *** [Makefile:84: test] Error 8

2019년 8월 12일 (월) 오전 10:47, Ravishankar Sundararaman < notifications@github.com>님이 작성:

Hi Soonho,

Did you perhaps run "make test" directly on the login / frontend node? If I remember correctly, NERSC prevents any MPI-compiled executables from running on the nodes, so that's why all the tests failed immediately. Try running "make test" within a debug job to avoid this. You might want to set the JDFTX_LAUNCH variable within the script; see http://jdftx.org/Testing.html for more options.

Best, Shankar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN22PZ37IRLFWX72VZLQEGO25A5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4DJNLQ#issuecomment-520525486, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN7A5O7SBKSX6EEMLALQEGO25ANCNFSM4IK2WSKQ .

shankar1729 commented 5 years ago

From an offline comment from @TiffanyAnn (related to issue #64):

For testing the code on Cori, need to request interactive QOS with the command

salloc -N 1 -C haswell / knl -q interactive -t 01:00:00 and then do make test and then all the tests should pass.

Let me know if that works for you.

KwonSnow commented 5 years ago

Thank you for your suggestion.

It worked but, it took way much time to be finished compared to the time reported in http://jdftx.org/Testing.html I followed the exact instruction in http://jdftx.org/Supercomputers.html

Sincerely, Soonho

=====================================

shkwon@cori05:~/jdftx-1.4.2/build> salloc -N 1 -C knl -q interactive -t 01:00:00 salloc: Granted job allocation 23797661 salloc: Waiting for resource configuration salloc: Nodes nid02306 are ready for job shkwon@nid02306:~/jdftx-1.4.2/build> make test Running tests... Test project /global/homes/s/shkwon/jdftx-1.4.2/build Start 1: openShell 1/10 Test #1: openShell ........................ Passed 179.58 sec Start 2: vibrations 2/10 Test #2: vibrations ....................... Passed 251.97 sec Start 3: moleculeSolvation 3/10 Test #3: moleculeSolvation ................ Passed 583.34 sec Start 4: ionSolvation 4/10 Test #4: ionSolvation ..................... Passed 377.57 sec Start 5: latticeOpt salloc: Job 23797661 has exceeded its time limit and its allocation has been revoked. make: *** [Makefile:84: test] Terminated Terminated shkwon@nid02306:~/jdftx-1.4.2/build> exit srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: nid02306: task 0: Exited with exit code 143 srun: Terminating job step 23797661.0

2019년 8월 13일 (화) 오전 9:31, Ravishankar Sundararaman notifications@github.com님이 작성:

From an offline comment from @TiffanyAnn https://github.com/TiffanyAnn (related to issue #64 https://github.com/shankar1729/jdftx/issues/64):

For testing the code on Cori, need to request interactive QOS with the command

salloc -N 1 -C haswell / knl -q interactive -t 01:00:00 and then do make test and then all the tests should pass.

Let me know if that works for you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN2OFMTOCCW3BN3WUN3QELOVJA5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4GHBQQ#issuecomment-520908994, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN5O4FD4YNQSL35YLRDQELOVJANCNFSM4IK2WSKQ .

shankar1729 commented 5 years ago

Try exporting JDFTX_LAUNCH="mpirun --bind-to none" beforehand. You may need to mix in hybrid MPI/threads job running instructions from NERSC also; they keep changing this, and I don't have access to these clusters at the moment.

shankar1729 commented 5 years ago

One more thing, I just noticed you were running on the Xeon Phis: those have much lower per-core performance and these tests are not set up (and are too small) to run them parallelly. (You will end up using at most 4 of the 64 phi cores in these tests.) Use the regular cpu nodes for these tests.

More generally, we typically don't get great performance on the Phis (Intel seems to have given up on them too), and have focused our optimizations on cpus and nvidia gpus.

Best, Shankar

KwonSnow commented 5 years ago

I tried exporting JDFTX_LAUNCH="mpirun --bind-to none" with haswell node but it didn't work. (all 10 jobs failed) Do you think the new update of Cori might be a problem? https://www.nersc.gov/users/computational-systems/cori/updates-and-status/programming-environment-change-on-cori-in-july-2019/

Thank you for your help.

Sincerely, Soonho

2019년 8월 14일 (수) 오전 5:06, Ravishankar Sundararaman notifications@github.com님이 작성:

One more thing, I just noticed you were running on the Xeon Phis: those have much lower per-core performance and these tests are not set up (and are too small) to run them parallelly. (You will end up using at most 4 of the 64 phi cores in these tests.) Use the regular cpu nodes for these tests.

More generally, we typically don't get great performance on the Phis (Intel seems to have given up on them too https://www.anandtech.com/show/14305/intel-xeon-phi-knights-mill-now-eol), and have focused our optimizations on cpus and nvidia gpus.

Best, Shankar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN7GXJMT7QZD6CAUWJLQEPYMFA5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ISVRQ#issuecomment-521218758, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN6AOQGGLGBVCTUEPM3QEPYMFANCNFSM4IK2WSKQ .

shankar1729 commented 5 years ago

Hi Soonho,

Need a little more info on the failure: could you examine the .out files within the test/* directories and find out what error message these end in? (And maybe attach a couple of these files.)

Best, Shankar

KwonSnow commented 5 years ago

Since none of the tests was successful, there was no .out file in the test directory. Followings are the output from the interactive window.

================================================= shkwon@cori07:~/jdftx-1.4.2/build> make testclean Built target testclean shkwon@cori07:~/jdftx-1.4.2/build> export JDFTX_LAUNCH="mpirun --bind-to none" shkwon@cori07:~/jdftx-1.4.2/build> salloc -N 1 -C haswell -q interactive -t 01:00:00 salloc: Granted job allocation 23835628 salloc: Waiting for resource configuration salloc: Nodes nid00120 are ready for job shkwon@nid00120:~/jdftx-1.4.2/build> make test Running tests... Test project /global/homes/s/shkwon/jdftx-1.4.2/build Start 1: openShell 1/10 Test #1: openShell ........................Failed 0.06 sec Start 2: vibrations 2/10 Test #2: vibrations .......................Failed 0.18 sec Start 3: moleculeSolvation 3/10 Test #3: moleculeSolvation ................Failed 0.16 sec Start 4: ionSolvation 4/10 Test #4: ionSolvation .....................Failed 0.05 sec Start 5: latticeOpt 5/10 Test #5: latticeOpt .......................Failed 0.04 sec Start 6: metalBulk 6/10 Test #6: metalBulk ........................Failed 0.05 sec Start 7: plusU 7/10 Test #7: plusU ............................Failed 0.12 sec Start 8: spinOrbit 8/10 Test #8: spinOrbit ........................Failed 0.05 sec Start 9: graphene 9/10 Test #9: graphene .........................Failed 0.07 sec Start 10: metalSurface 10/10 Test #10: metalSurface .....................Failed 0.05 sec

0% tests passed, 10 tests failed out of 10

Total Test time (real) = 0.97 sec

The following tests FAILED: 1 - openShell (Failed) 2 - vibrations (Failed) 3 - moleculeSolvation (Failed) 4 - ionSolvation (Failed) 5 - latticeOpt (Failed) 6 - metalBulk (Failed) 7 - plusU (Failed) 8 - spinOrbit (Failed) 9 - graphene (Failed) 10 - metalSurface (Failed) Errors while running CTest make: *** [Makefile:84: test] Error 8

2019년 8월 14일 (수) 오전 10:56, Ravishankar Sundararaman < notifications@github.com>님이 작성:

Hi Soonho,

Need a little more info on the failure: could you examine the .out files within the test/* directories and find out what error message these end in? (And maybe attach a couple of these files.)

Best, Shankar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN577YQ53QYXYWAV62TQERBN3A5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JTGCI#issuecomment-521351945, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN26X6QMRLHCZR4XG5DQERBN3ANCNFSM4IK2WSKQ .

shankar1729 commented 5 years ago

It might be that the --bind-to none addition to mpirun is not accepted by the mpi implementation you are using on cori. (Also it was supposed to be "mpirun --bind-to none -n %d"; we only added the --bind-to none.)

So looking further at the NERSC instructions for hybrid mpi / threads codes it appears that the mpi launcher within job sis supposed to be srun and not mpirun. So you might want to try export JDFTX_LAUNCH="srun -n %d -c1". This may take care of cpu binding as well.

KwonSnow commented 5 years ago

I'm sorry that even srun didn't work. And there is no output in test directory. Is there any run option for debug to see all the erorrs in jdftx?

shkwon@cori04:~/jdftx-1.4.2/build> module load gsl cray-fftw shkwon@cori04:~/jdftx-1.4.2/build> make testclean Built target testclean shkwon@cori04:~/jdftx-1.4.2/build> export JDFTX_LAUNCH="srun -n %d -c1" shkwon@cori04:~/jdftx-1.4.2/build> salloc -N 1 -C haswell -q interactive -t 01:00:00 salloc: Granted job allocation 23841464 salloc: Waiting for resource configuration salloc: Nodes nid00051 are ready for job shkwon@nid00051:~/jdftx-1.4.2/build> make test Running tests... Test project /global/homes/s/shkwon/jdftx-1.4.2/build Start 1: openShell 1/10 Test #1: openShell ........................Failed 0.32 sec Start 2: vibrations 2/10 Test #2: vibrations .......................Failed 0.27 sec

... Errors while running CTest make: *** [Makefile:84: test] Error 8

2019년 8월 14일 (수) 오전 11:53, Ravishankar Sundararaman < notifications@github.com>님이 작성:

It might be that the --bind-to none addition to mpirun is not accepted by the mpi implementation you are using on cori. (Also it was supposed to be "mpirun --bind-to none -n %d"; we only added the --bind-to none.)

So looking further at the NERSC instructions for hybrid mpi / threads codes https://docs.nersc.gov/jobs/examples/ it appears that the mpi launcher within job sis supposed to be srun and not mpirun. So you might want to try export JDFTX_LAUNCH="srun -n %d -c1". This may take care of cpu binding as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shankar1729/jdftx/issues/69?email_source=notifications&email_token=AKUIGN5AYBWALDBCEQCVC43QERICVA5CNFSM4IK2WSK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JYIOY#issuecomment-521372731, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUIGN4CGR23PU6R5OT5D6DQERICVANCNFSM4IK2WSKQ .

shankar1729 commented 5 years ago

Okay, I'd say we give up on performance tuning "make test" since its not worth it. You already found that make tst gives the correct results, albeit slowly. Now just run regular jobs with jdftx which will give the full outputs, and we can discuss if the performance is reasonable. At least that way you will get some output, and we won't be left wondering why these jobs are producing no output.