Closed dingld closed 9 years ago
Please attach your input/output.
On Thu, Aug 20, 2015 at 2:05 AM, dingld notifications@github.com wrote:
The program didn't complain about any error, but I am afraid something actually went wrong. Actually I tried to do the same calculation, DMRG-CASCI-[12e,28o], with different number of procs (8 vs 16). But the totoal time is almost the same (7424.351 vs 7384.303 ). I compiled openmpi using gcc-4.8.4, and boost-1.5.5 the same. By the way, the same calculation using Block-0.9.6 cost about 2000s more or less using 16 procs. could be the problem ? How should I deal with this?
— Reply to this email directly or view it on GitHub https://github.com/sanshar/Block/issues/20.
Might be easier to personal message me with the input/output and/or a description of the system that you are trying to run.
SYSTEM-- LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 6.3 (Santiago) Release: 6.3 Codename: Santiago nelec 12
INPUT--
spin 0
irrep 1
point_group d2h
sweep_tol 1.0e-9
schedule default
outputlevel 0
maxiter 24
maxm 2000
twodot
orbitals FCIDUMP1.68
reorder order1.68
scratch ./tmp
hf_occ integral
OUTPUT--last few lines
System Block Sites :: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # states: 110 # states: 110
Environment Block Sites :: 0 1 # states: 10 # states: 10
Total discarded weight 0.0000000000
Total block energy for State [ 0 ] with 2000 States :: -2099.3167257663
Finished Sweep with 2000 states and sweep energy for State [ 0 ] with Spin [ 0 ] :: -2099.3167599819
Largest Error for Sweep with 2000 states is 0.0000161238
M = 2000 state = 0 Largest Discarded Weight = 1.612e-05 Sweep Energy = -2099.3167599819
============================================================================
Elapsed Sweep CPU Time (seconds): 17167.590
Elapsed Sweep Wall Time (seconds): 1089.231
Finished Sweep Iteration 30
BLOCK CPU Time (seconds): 116102.920
BLOCK Wall Time (seconds): 7384.303
Can you please send the entire output file of the two runs (with 8 and 16 cores). It's hard to diagnose anything with just the last 2 lines of the output.
Sandeep.
On Thursday, August 20, 2015, dingld notifications@github.com wrote:
SYSTEM-- LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 6.3 (Santiago) Release: 6.3 Codename: Santiago nelec 12
INPUT--
spin 0 irrep 1 point_group d2h sweep_tol 1.0e-9 schedule default outputlevel 0 maxiter 24 maxm 2000 twodot orbitals FCIDUMP1.68 reorder order1.68 scratch ./tmp
hf_occ integral
OUTPUT--last few lines
Block Iteration :: 24
System Block Sites :: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # states: 110 # states: 110 Environment Block Sites :: 0 1 # states: 10 # states: 10
Total discarded weight 0.0000000000 Total block energy for State [ 0 ] with 2000 States :: -2099.3167257663 Finished Sweep with 2000 states and sweep energy for State [ 0 ] with Spin [ 0 ] :: -2099.3167599819 Largest Error for Sweep with 2000 states is 0.0000161238 M = 2000 state = 0 Largest Discarded Weight = 1.612e-05 Sweep Energy = -2099.3167599819 ============================================================================ Elapsed Sweep CPU Time (seconds): 17167.590 Elapsed Sweep Wall Time (seconds): 1089.231 Finished Sweep Iteration 30 BLOCK CPU Time (seconds): 116102.920 BLOCK Wall Time (seconds): 7384.303
— Reply to this email directly or view it on GitHub https://github.com/sanshar/Block/issues/20#issuecomment-132922895.
Are you able to send us the input file + integrals so we can try the calculation? Also, when you ran v0.9.6, did you observe the same number of sweeps etc. in the faster run? It seems impossible that it could run 4 times faster with the same sweep schedule.
It is quite possible that for such a small number of orbitals, on a single node, speed up between 8 cores and 16 cores will not be good. For comparison, recent timings with 8 cores for a 10e/41 orbital H2O/ANO-DZ/M=1000 sweep is ~1200s, while with 16 cores it is about 800s.
We also observe significant speedups using the intel compiler. If you are using icpc 15.0.3, because of a bug in the compiler you will need to use the latest Block snapshop rather than the v1.0.1 release.
Mark as closed due to no response from user.
The program didn't complain about any error, but I am afraid something went wrong. Actually I tried to do the same calculation, DMRG-CASCI-[12e,28o], with different number of procs (8 vs 16). But the totoal time is almost the same (7424.351 vs 7384.303 ). I compiled openmpi using gcc-4.8.4, and boost-1.5.5 the same. By the way, the same calculation using Block-0.9.6 cost about 2000 seconds using 16 procs. How should I deal with this?