Simint memory usage is very high compared to Texas

glowthrower commented 3 years ago

Describe the bug

The required local memory costs for integral calculation with the Simint library are significantly higher than those for the Texas and McMurchie-Davidson integral libraries.

For an SCF calculation on a sample H2O system with a large (cc-pCV6Z) basis set, for example, about ~1.4GB local memory per process is required to start the SCF calculation successfully (memory total 2000 mb global 600 mb). If I decrease the amount of local by 100MB (so memory total 2000 mb global 700 mb) , I see a crash with the following

fock_rep_txs: insufficient memory            176307258

Decreasing by 100MB again (memory total 2000 mb global 800 mb) pushes the crash into initial Schwarz screening:

schwarz_init: insufficient memory           140344072

By contrast, using Texas, the run requires only around 130MB of local memory to get into the SCF loop successfully (memory total 2000 mb global 1870 mb). When given less than that, we see:

schwarz_init: insufficient memory             9982805

The McMurchie-Davidson integral driver is even more tolerant, and will proceed into the SCF loop with as little as 50MB of local memory (total 2000 mb global 1950 mb). Below that, errors occur, but not the same schwarz_init one.

Describe settings used

NWChem 7.0.2, Intel Parallel Studio 2020 Update 2. Simint compiled in with SIMINT_MAXAM=6.

Attach log files

Here is a log of a failing run with Simint, corresponding to the input deck below:

simint_memory.log

To Reproduce

Execute the following input deck:

echo
start h2o_cc_pcv6z_simint

permanent_dir </path/to/permanent/dir>

memory total 2000 mb global 700 mb

geometry units angstrom noautoz noautosym
O   0.0000000000   0.0000000000   0.0000000000
H   0.7569520600   0.0000000000   0.5858836200
H  -0.7569520600   0.0000000000   0.5858836200
end

basis spherical
  O library cc-pcv6z
  H library cc-pv6z
end

set lindep:n_dep 0

# Enable Simint.
set int:cando_txs f
set int:cando_nw f

scf
  thresh 1.0e-08
  maxiter 50
  direct
end
task scf

Expected behavior

Naively, without deeper knowledge of the underlying libraries, I would expect the amount of workspace local memory required for the various integral calculation libraries to be about the same order of magnitude.

Additional context

I am and always have been been confused by the distinction between heap and stack memory in NWChem, especially as it's used by different tasks. As a result, I specify memory allocations by giving total and global, and simply let the remainder be allocated according to the compiled-in default ratios.

It's not clear to me if the integral workspace is drawn from the total local memory pool (heap + stack), or if it truly resides in only one of them. Either way, I found the failure limits I've given above by fixing total, modifying global, and leaving stack and heap implicit. So maybe things could be tuned a bit more by explicitly considering stack and heap.

It's worth mentioning that the impact of this is particularly relevant in the context of large TCE runs, where global memory is at a premium. For the kind of runs I'm doing now, for example, I have about 5GB of total memory available per process, for a few thousand processes. Increasing the local memory allocation enough to give Simint room to operate will translate into terabytes less of aggregate global memory available to TCE when compared to Texas or McMurchie-Davidson...

edoapra commented 3 years ago

I can't reproduce this failure with the current master branch memory total 400 mb global 300 mb does work. The last series of changes to https://github.com/edoapra/simint-generator/ might have fixed this

glowthrower commented 3 years ago

Interesting!

Here's something that might be worth looking at. In the original issue, you posted a log of a run with simint and the same basis set. In it, I see the following dump re: memory usage:

  Using Simint Integral package
  num_der                     0
  simint: mem_2e4c                 140344072
  simint: memb_2e4c                154378488
  simint: isz_2e4c                 17543016
  simint: iszb_2e4c                17543016
  screen_method                      2
  screen_tol   1.000000000000000E-022

That matches with mine, from the log posted above:

  Using Simint Integral package
  num_der                     0
  simint: mem_2e4c                 140344072
  simint: memb_2e4c                154378488
  simint: isz_2e4c                 17543016
  simint: iszb_2e4c                17543016
  screen_method                      2
  screen_tol   1.000000000000000E-022

Note in particular that the value for simint: mem_2e4c is exactly the value dumped when my run crashed in schwarz_init. Assuming that's doubles, and converting that into bytes, I get about 1070MB, which seems consistent.

What does this look like in your new and improved version?

edoapra commented 3 years ago

The current output from master is definitely different

  Using Simint Integral package
  num_der                    0
  simint: mem_2e4c                  1609432
  simint: memb_2e4c                 1770384
  simint: isz_2e4c                  614664
  simint: iszb_2e4c                 614664
  screen_method                     2
  screen_tol    1.0000000000000000E-022

glowthrower commented 3 years ago

OK, time for a make clean && make then. :smile: I'll let you know how it goes.

edoapra commented 3 years ago

This could be the commit that fixes the problem https://github.com/edoapra/simint-generator/commit/e1adad2f550d1022fa0217053ab596988929a2a7

edoapra commented 3 years ago

OK, time for a make clean && make then. 😄 I'll let you know how it goes.

I am assuming you are using the master branch and you have executed the command git pull, right?

glowthrower commented 3 years ago

No, I'm still working in a tree that I've extracted from the 7.0.2 tarball, with a couple of careful hotfix patches relating to some of the previous bugs that I've reported and that you've fixed. I would use master, but I've had some problems with what I guess is a newer version of GA that gets downloaded. I don't really have the time to troubleshoot that, so I'm sticking with the stable release for now...

I thought, from looking at the version of src/NWints/simint/libsimint_source/build_simint.sh that I have here, that it would just pull master from the simint-generator repo and use that.

Or is it worth carefully porting the entire src/NWints/simint folder from master into my "stable" (hah) tree?

edoapra commented 3 years ago

I am having second thoughts about the much reduced memory requirement for SImint. This "optimistic" behavior is likely due to some incorrect compilation settings. I need to recompile. Later

glowthrower commented 3 years ago

Yeah, something doesn't quite add up on my side.

I reran my build (based on the scripts etc. in the 7.0.2 tarball, as described above). It definitely uses the newest commit on master from your simint-generator fork (005a087), so has the commit that you mentioned. It builds and links fine, but the Simint memory usage is unchanged.

I'm currently building a fresh copy of everything from NWChem master, to see if I can reproduce the memory dump you posted above.

glowthrower commented 3 years ago

Out of curiosity, what's the purpose of the patch lines in the build_simint.sh script? Are they still needed?

The first patch line (adding #include "generator/Types.hpp" to CommandLine.hpp) seems to be redundant. CommandLine.hpp is included by six other files:

./generator/CommandLine.cpp:#include "generator/CommandLine.hpp"
./generator/ostei/ostei_et_generator.cpp:#include "generator/CommandLine.hpp"
./generator/ostei/ostei_vrr_generator.cpp:#include "generator/CommandLine.hpp"
./generator/ostei/ostei_hrr_generator.cpp:#include "generator/CommandLine.hpp"
./generator/ostei/ostei_generator.cpp:#include "generator/CommandLine.hpp"
./generator/ostei/ostei_deriv1_generator.cpp:#include "generator/CommandLine.hpp"

Of these, the first (CommandLine.cpp) doesn't seem to use anything from Types.hpp. The other five files do, but they pick up Types.hpp transitively by including, e.g., generator/ostei/OSTEI_GeneratorInfo.hpp.

The other two patch lines are a bit less clear to me. It looks like they're intended to disable some code that's emitted e.g. here...?

glowthrower commented 3 years ago

OK, I've poked around for a while and maybe I have some ideas. What follows is a stream-of-consciousness dive into where these values come from...

The memory usage of simint seems to be specified by the two functions simint_ostei_worksize(int derorder, int maxam) and simint_ostei_workmem(int derorder, int maxam) in ostei_config.h. The latter just returns the former, multiplied by sizeof double.

simint_ostei_worksize(int derorder, int maxam) references into a set of values that are effectively static and determined during execution of create.py. When I build using build_simint.sh, using AVX512 and therefore with SIMINT_SIMD_LEN=8, they come out for 0 <= maxam <= 6 as:

Worksize (0D): 0: 24
Worksize (0D): 1: 2576
Worksize (0D): 2: 38952
Worksize (0D): 3: 283312
Worksize (0D): 4: 1395032
Worksize (0D): 5: 5382752
Worksize (0D): 6: 17543008

Worksize (1D): 0: 232
Worksize (1D): 1: 13640
Worksize (1D): 2: 169704
Worksize (1D): 3: 1141536
Worksize (1D): 4: 5511488
Worksize (1D): 5: 21481664
Worksize (1D): 6: 71657904

Workmem (0D): 0: 192
Workmem (0D): 1: 20608
Workmem (0D): 2: 311616
Workmem (0D): 3: 2266496
Workmem (0D): 4: 11160256
Workmem (0D): 5: 43062016
Workmem (0D): 6: 140344064

Workmem (1D): 0: 1856
Workmem (1D): 1: 109120
Workmem (1D): 2: 1357632
Workmem (1D): 3: 9132288
Workmem (1D): 4: 44091904
Workmem (1D): 5: 171853312
Workmem (1D): 6: 573263232

Looking at nwcsim_facef90.F, the two values isz_2e4c and mem_2e4c are set as

         isz_2e4c = max(isz_2e4c,
     S        simint_eri_worksize(num_der, max_ang))
         mem_2e4c = max(mem_2e4c,
     S        simint_eri_workmem(num_der, max_ang))

(In this case, the values being used are 17543008 and 140344064 respectively.) They're then passed through util_align, which presumably tries to make sure they're padded for aligned allocation:

      call util_align(isz_2e4c,SIMINT_SIMD_LEN)
      call util_align(mem_2e4c,SIMINT_SIMD_LEN)

Once "aligned", these values are read by int_mem_2e4c, which copies them into out variables maxg and mscratch_2e.

      subroutine int_mem_2e4c(maxg, mscratch_2e)
...
      integer maxg        !< [Output] max 2e4c buffer size
      integer mscratch_2e !< [Output] max scr for 2e ints
...
      if (int_chk_init('int_mem_2e4c')) then
        maxg        = isz_2e4c
        mscratch_2e = mem_2e4c
      else
        call errquit('int_mem_2e4c: int_init was not called',0, INT_ERR)
      endif
      end

And int_mem_2e4c is called in schwarz_init as call int_mem_2e4c(max2e, mem2). The values max2e (== maxg == isz_2e4c == simint_eri_worksize(num_der, max_ang)) and mem2 (== mscratch_2e == mem_2e4c == simint_eri_workmem(num_der, max_ang)) are then finally used to construct arrays from local memory:

      status = status .and. 
     $     ma_push_get(MT_DBL, max2e, 'fock_2e: buf', l_g, k_g)
      status = status .and.
     $     ma_push_get(MT_DBL, mem2, 'fock_2e: scr', l_scr, k_scr)
      if (.not. status)
     $     call errquit('schwarz_init: insufficient memory', mem2,
     &       MA_ERR)

Some points:

1) I can't reproduce the value you get for mem_2e4c (1609432), either on 7.0.2 or master... As you say, different compile settings? 2) I don't know, but I would guess that ma_push_get(MT_DBL, n, ...) is allocating space for n doubles. Doesn't this mean that the original setting of mem_2e4c should be from simint_ostei_worksize rather than simint_ostei_workmem? 3) I'm not sure that util_align is doing the right thing. In particular, it's not clear to me what the line in=in+alignval is supposed to achieve. In this particular case, since the inputs (17543008 and 140344064) are already divisible by 8, I wouldn't expect that they need to be modified, but they both are..? 4) Does the MA allocator ensure that the start of an array is properly aligned for SIMD access? Or is this the reason for in=in+alignval? 5) Does it make sense that isz_2e4c is set to simint_ostei_worksize? Maybe a better way of asking this: does the screening step really need an entirely separate workspace? I don't know how the integral API is structured or used, so perhaps I'm missing something here.

edoapra commented 3 years ago

Your comments no. 5 was in the right direction. isz_2e4c (the number of integrals) was not correctly computed. Commit https://github.com/nwchemgit/nwchem/commit/56875e03d10ae8b48c554a567dbc7cf13d89bcef fixes this issue.

mem_2e4c uses simint_ostei_worksize, that is dependent on the simd vectorization length SIMINT_SIMD_LEN. This was the other variable that causes different memory requirements on avx2 vs avx512 hardware (SIMINT_SIMD_LEN 4 vs 8)

edoapra commented 3 years ago

To summarize the output changes for the reproducer of this issue, isz_2e4c is now 614664 (was 17543016). This should bring the total memory requirement to 500 mb

memory total 500 mb global 300 mb

glowthrower commented 3 years ago

Thanks very much for this, and for the quick turnaround!

I hope you won't take it amiss if I poke at your solution a bit. :smiley: Since integrals are such a, well, integral part of all this, it seems prudent to double-check that everything here works correctly.

It's clear now that mem_2e4c should be set to the number of doubles needed for scratch space. I say "doubles" since mem_2e4c is used as, e.g.

      status = status .and.
     $     ma_push_get(MT_DBL, mem2 <== mem_2e4c, 'fock_2e: scr', l_scr, k_scr)

where I've written mem2 <== mem_2e4c to indicate the dependence of mem2 on mem_2e4c through a call to int_mem_2e4c. Similarly, isz_2e4c should be the number of integrals in the largest shell block.

I agree that these values are now being set correctly for the case under consideration here, but I'm not quite sure that this is happening correctly in the general case.

1) I think the bracketing in your revised mem_2e4c = max(mem_2e4c, ...) line might be subtly broken. If the intention is to take the maximum value over all basis sets, then I think the division term / MA_sizeof(MT_INT,1,MT_BYTE) should be inside the outermost brackets. This is easier to see if we write the call to simint_eri_workmem(...) as func, and the dividend div. After your change, the relevant line is

       mem_2e4c = max(mem_2e4c, func) / div

But I think it should probably be

       mem_2e4c = max(mem_2e4c, func / div)

2) I don't think you actually need an explicit division term at all, and I'm not sure it works properly in all cases. As noted in the comment in your revision, simint_eri_workmem returns the required scratch space in bytes. You'll only get the correct value in doubles here if sizeof(MT_INT) == sizeof(MT_DBL) -- which is true for 64-bit integers, of course, but I'm not sure if that can or should be relied on. As I mentioned, you can avoid the issue entirely by just calling simint_eri_worksize, which returns precisely the number of doubles needed for scratch. Indeed, the implementation of simint_eri_workmem returns just the result of this function, multiplied by sizeof double.

3) It looks like isz_2e4c is now not set anywhere in the simint-specific code, but is rather assumed to have been set in a previous routine. Certainly it works in the case here, but is that safe in general?

4) I'm still a bit concerned about memory alignment. (EDIT: this may not be a concern -- see next comment.) As I'm sure you know, allocating memory for explicitly SIMDised code usually requires considering two things: alignment and padding. The util_align function seems to be misnamed, since it actually handles padding, and as far as I can tell, only padding... Alignment itself is, again as far as I can tell, never explicitly treated anywhere in the NWChem code relevant for calls into the simint library. That's a potential issue, because all the autogenerated routines in simint start as, e.g. (here for the case of a (pp|sd) quartet):

int ostei_p_p_s_d(struct simint_multi_shellpair const P,
                  struct simint_multi_shellpair const Q,
                  double screen_tol,
                  double * const restrict work,
                  double * const restrict INT__p_p_s_d)
{

    SIMINT_ASSUME_ALIGN_DBL(work);
    SIMINT_ASSUME_ALIGN_DBL(INT__p_p_s_d);

The SIMINT_ASSUME_ALIGN_DBL macro is defined in simint's vectorization/vectorization.hpp as:

#define SIMINT_SIMD_ALIGN_DBL (SIMINT_SIMD_LEN*8)
#define SIMINT_SIMD_ALIGN_INT (SIMINT_SIMD_LEN*sizeof(int))
...
#if defined __INTEL_COMPILER
    #define SIMINT_ASSUME_ALIGN_DBL(x)  __assume_aligned((x), SIMINT_SIMD_ALIGN_DBL)
    #define SIMINT_ASSUME_ALIGN_INT(x)  __assume_aligned((x), SIMINT_SIMD_ALIGN_INT)
#elif defined __clang__
    // TODO - find these
    #define SIMINT_ASSUME_ALIGN_DBL(x)
    #define SIMINT_ASSUME_ALIGN_INT(x)
#elif defined __GNUC__
    // TODO - find these
    #define SIMINT_ASSUME_ALIGN_DBL(x)
    #define SIMINT_ASSUME_ALIGN_INT(x)
...

That means that the code in simint is compiled (with icc at least) under the assumption that the pointers passed into simint_compute_eri(...) and simint_compute_eri_deriv(...) as double * work and double * integrals are, in the AVX512 case, aligned on 64-byte boundaries. I don't see that this contract is enforced anywhere in NWChem, since these memory blocks are all allocated by calls to e.g. MA_push_get(...). There is a function MA_set_numalign(...), but it doesn't seem to be called anywhere explicitly.

The worst-case outcome would be a segfault if/when the simint code issues an aligned load on an unaligned address. I haven't gotten my hands dirty with the low-level details since the early Xeon Phi days, when exactly this problem regularly broke my code, but I think that modern Xeon architectures are more forgiving (as evidenced by the fact that this hasn't already broken things). However, this could perhaps still cause performance problems such as pipeline stalls. @jeffhammond, you seem to know lots about the Intel side of things -- do you think this would be an issue?

5) As an aside to the previous, I'm still unclear why the in=in+alignval is necessary in util_align... I think it just leads to over-padding by alignval entries. But this is minor, since we're only talking about a few bytes. :smiley:

glowthrower commented 3 years ago

Hmm, I've checked the last, and it seems that the pointers as received by the simint routines are indeed 64-byte aligned, at least in my particular build. I still don't see how this is happening, but maybe that means that this is totally nothing to worry about. :smiley:

edoapra commented 3 years ago

Closing this issue since the commits have diminished the Simint memory requirements

nwchemgit / nwchem

Simint memory usage is very high compared to Texas #372