[BUG] Big memory usage for density evaluation

leila-pujal commented 3 years ago

Describe the bug

Big memory usage for density evaluation **To Reproduce** I attach the system I used to test: [iron_hexacarbonyl.tar.gz](https://github.com/theochem/gbasis/files/6792948/iron_hexacarbonyl.tar.gz) ``` mol=load_one("iron_hexacarbonyl.fchk") dm = mol.one_rdms['scf'] basis, coord_types = from_iodata(mol) becke = BeckeWeights(order=3) oned = GaussChebyshevType2(100) rgrid = BeckeTF(1e-4, 1.5).transform_1d_grid(oned) grid = MolGrid.horton_molgrid(mol.atcoords, mol.atnums, rgrid, 1030, becke) evaluate_density(dm, basis, grid.points, coord_type=coord_types) ``` **Expected behaviour**

I don't know if this could be improved, I am just reporting the behaviour as an issue.

Screenshots

System Information:

OS: 18.04.5 LTS
Python version: 3.6.10
NumPy version: 1.18.1
SciPy version: 1.4.1

Additional context

For computing the density it starts with evaluate_density(), then evaluates each orbital by computing the density for the basis functions and then it obtains the total density by using the density matrix. For evaluating the basis functions density, it is done in _deriv.py module inside _eval_deriv_contractions(). To monitor the memory I used several prints with nbytes for arrays and @profile from memory_profiler import profile. To monitor the memory of each contraction I printed the associated memory of matrix_contraction in base one https://github.com/theochem/gbasis/blob/c57de8f1cfbe1dbaac8941c7cb9993797c76fca6/gbasis/base_one.py#L239. Then I profiled the memory in _eval_deriv_contractions(), evaluate_density() and evaluate_density_using_evaluated_orbs. With the information I gathered I think the memory dependcy in the code is as follows. Through the evaluation of all the generalized contractions, the memory will scale as a function of AO orbitals generated for each angular momentum. S-generalized contractions are the lowest memory cost and to obtain the memory cost for higher angular momentum generalized contractions you have to multiply the memory cost of the S-generalized contractions by the number of generated atomic orbitals (e.g l=1, 3p orbitals. Increase in memory = S-generalized contractions-memory x 3) The base memory corresponding to S-generalized contractions scales linearly with the number of points of the molecular grid. I fitted some values for not so big molecular sizes and I got the following regression S-generalized contractions memory (MB)=(0.000008150229943 *n_points) - 0.24118548. I checked to predict for some bigger ones and the predicted value was around 2MB wrong. To show the agreement of the memory predicted with the one monitored I used iron_hexacarbonyl:

For a radial grid =100points angular = 1030 Total = 100 1202 13atoms = 1562600

Iron hexacarbonyl
Total atoms	13
Total electrons	108
Total number generalized contractions	113

Atom types	Fe	oxygen	carbon
total number	1	6	6
s shell	5	4	4
p shell	4	2	3
d shell	2	2	2
f shell	1

Base memory for S-generalized contractions = 12.49MB

	Memory = base x AO_generated x number_shells
	nitrogen	oxygen	carbon
s shell	60 MB	48 MB	48 MB
p shell	144 MB	72 MB	108 MB
d shell	144 MB	144 MB	144 MB
f shell	120 MB	0	0
total (1atom)	468 MB	264MB	300MB	total memory
Total all atoms	468 MB	1584 MB	1800 MB	3852 MB

The values I obtained profiling the code are shown in the following pictures:

this is the profiling for evaluate density()

You can see evaluates basis increases almost the same memory that it has been calculated in line 78.

But also, there's a spike of memory that it is not shown here that occurs in evaluate_density_using_evaluated_orbs. The profiling for this is shown in the following image.

Here you can see in line 63 where the memory is doubled.

PaulWAyers commented 3 years ago

The memory-doubling should be avoidable I think.

A bigger issue is that we should be using some screening to not evaluate all the orbital overlaps, because we don't need them all. That is going to take a bit of refactoring, though perhaps @BradenDKelly has already looked at this?

BradenDKelly commented 3 years ago

@leila-pujal @PaulWAyers yes, there are two different PRs for screening. Either one would save some time/calculations. I will investigate once home.

PaulWAyers commented 12 months ago

@leila-pujal @marco-2023 this is the issue Leila was talking about. If there is a pull request open, it may be worth reviewing and merging it.

leila-pujal commented 12 months ago

I checked and the PR related to this seems to be #110 and #111. I will check them and see if they can help us to fix this problem.

leila-pujal commented 7 months ago

I opened a PR #166 that fixes this bug. Below there is my testing code that uses the new grid (iron_hexacarbonyl.fchk is provided at the top of the issue):

import numpy as np
from iodata import load_one

from grid.becke import BeckeWeights
from grid.molgrid import MolGrid
from grid.onedgrid import GaussChebyshev
from grid.rtransform import BeckeRTransform

from gbasis.evals.density import evaluate_density
from gbasis.wrappers import from_iodata

mol=load_one("iron_hexacarbonyl.fchk")
dm = mol.one_rdms['scf']
basis = from_iodata(mol)
becke = BeckeWeights(order=3)

aim_weights = BeckeWeights(radii=None, order=3)
oned = GaussChebyshev(100)
rgrid = BeckeRTransform(1e-4, 1.5).transform_1d_grid(oned)
grid = MolGrid.from_preset(mol.atnums, mol.atcoords, 'insane', rgrid, aim_weights=aim_weights)

d=evaluate_density(dm, basis, grid.points)

This gives a smaller grid than the one used when this issue was opened. With the old code, we get the doubling in memory in line 45 in evaluate_density_using_evaluated_orbs:

With the new code we avoid creating the intermediate matrix density and the memory stays the same:

leila-pujal commented 7 months ago

I commented on PR #166, I talked with @Ali-Tehrani earlier and he pointed out that condensing the lines of code does not solve the problem. I confirmed this earlier this week because I actually had a similar problem, the profiler only tracks memory increased/decreased before and after lines of code but doing the operations inside the np.sum it creates the array and deletes it at the same time. At this point, I am not sure if this doubling in memory can be avoided. I don't remember if we talked about some other alternative for this part of the code to avoid this happening.

PaulWAyers commented 7 months ago

There are external libraries to fix this, like numexpr. We used to use the linear algebra factories for this.

I think there is a way to do this with numpy "in place" operations but I am no expert.

Ali-Tehrani commented 7 months ago

Don't think "in-place" will work for this specific case. The other suggestions alongside using "in-place" would be:

To use einsum, setting optimize="optimal" or "greedy". This will find intermediate optimizations, see here for how. Don't think it will help in this specific case.
Knowing how much available memory there is, then calculating optimal number of points to fit in memory (minus a safe margin) then computing slices at a time. In this example, assuming double precision it would be solving the number of points N in K= (M^2 + 2NM + N)*8, where K=available memory in bytes, 8 is from 8 bytes in a double and M=number of MO. M^2 comes from one-rdm, NM comes from orb_eval, NM comes from density-intermediate array and N is the size of the output.

msricher commented 2 months ago

@FarnazH @PaulWAyers

I think I was able to make some progress with this.

I changed line 60 of gbasis/evals/density.py and got this:

   112    993.4 MiB      0.0 MiB           1       if one_density_matrix.shape[0] != orb_eval.shape[0]:
   113                                                 raise ValueError(
   114                                                     "Number of rows (and columns) of the density matrix must be equal to the number of rows"
   115                                                     " of the orbital evaluations."
   116                                                 )
   117                                         
   118   1204.3 MiB    210.9 MiB           1       density = np.einsum("pq,qn,pn->n", one_density_matrix, orb_eval, orb_eval, optimize=True)
   119   1204.3 MiB      0.0 MiB           1       return density

Compare with the original:

    55    993.3 MiB      0.0 MiB           1       if one_density_matrix.shape[0] != orb_eval.shape[0]:
    56                                                 raise ValueError(
    57                                                     "Number of rows (and columns) of the density matrix must be equal to the number of rows"
    58                                                     " of the orbital evaluations."
    59                                                 )
    60                                         
    61   1815.7 MiB    822.4 MiB           1       density = one_density_matrix.dot(orb_eval)
    62   1815.7 MiB      0.0 MiB           1       density *= orb_eval
    63   1815.7 MiB      0.0 MiB           1       return np.sum(density, axis=0)

So, this is my solution which uses like 130% of the memory instead of 200%:

density = np.einsum("pq,qn,pn->n", one_density_matrix, orb_eval, orb_eval, optimize=True)

I looked at screening, but I'm really unsure of how it could be used here. This is my first contribution to the Python code proper and I've been a bit confused trying to understand it all.

Let me know if this suffices and I will open a PR. Cheers.

leila-pujal commented 2 months ago

Hi @msricher, thanks for working on this. Could you double check this is not the same situation as in here where the profiler can't monitor the memory that is being used within that line of code?

msricher commented 2 months ago

@leila-pujal I believe it's ok. einsum will try to avoid intermediates where possible.

leila-pujal commented 2 months ago

I did a quick check using command top to monitor real-time the memory usage. I get the same demand with both codes but I may be doing something wrong. I tried to also use tracemalloc module and I think I also see the same peak with both codes but I don't understand tracemalloc that well. On my side, I would like to see a more detailed evaluation of the memory because as it happened before the profiler only checks changes between lines not what happens inside the line. But if other reviewers are fine with it I will merge PR #191

msricher commented 2 months ago

OLD

old

NEW

new

The memory is reduced somewhat.

I used the mprof executable that comes with memory_profiler for this, which tracks memory over time.

EDIT: I had these mixed up. It seems this takes even more memory at the peak. So this isn't a good solution.

leila-pujal commented 2 months ago

Yeah that is what I saw with the top command in some runs I did.

msricher commented 1 month ago

I have some more information here. I profiled this functionality while indicating the boundaries between functions. It seems the biggest spike occurs in evaluate_basis(), while evaluating contractions. Then, I was able to improve the memory usage in evaluate_density_using_evaluated_orbs by chunking the computation over grid points. You can see in the graph that I used 4 chunks. I'm looking at optimizing this chunking, but for now, does this seem like an OK strategy?

mprof

PaulWAyers commented 1 month ago

I think chunking has to be done; then screening gets us to having the optimal scaling. The issue is that the "chunk size" needs to be adapted to the memory available. Once the chunk size is large enough, I expect the performance hit to be minimal.

PaulWAyers commented 1 month ago

I think my notes (within the pull request) really belong here, so I'm moving them over.

==== With transform=None, you can screen. The key idea is to keep track of which atomic-orbitals are needed for a given grid point.

For a grid point that is $d$ units way from the center of the basis function, the value of the (contracted) basis function is greater than

$$ c{\min} n{\min} d^\ell e^{-\alpha_{min} d^2} $$

$$ n{\min} = \sqrt{\frac{\sqrt{2 \alpha{\min}}(4\alpha_{\min})^\ell}{\sqrt{\pi}(2 \ell + 1)!!}} $$

where $c{\min}$, $n{\min}$, and $\alpha_{\min}$ denote the contraction coefficient, normalization constant, and exponent of the most diffuse Gaussian in a contracted basis function. If the cutoff tolerance is $\epsilon$, you do not need to include this basis function for grid points that are further away than

$$ \epsilon = c{\min} n{\min} d^\ell e^{-\alpha_{min} d^2} $$

This equation can be solved exactly for $\ell=0$ (in terms of logs) and $\ell=2$ (in terms of the Lambert W function). Otherwise one needs to solve it numerically. For $\ell = 0$,

$$ d = \left(-\frac{\ln \frac{\epsilon}{ c{\min} n{\min}}}{\alpha_{\min}}\right)^{\frac{1}{2}} $$

For $\ell = 2$ one has the equation,

$$ \frac{- \alpha{\min} d}{c{\min} n{\min}} = -\alpha{\min}d^2 e^{-\alpha_{min} d^2} $$

There are two roots, one near the nucleus (where d^2 is small) and one far out. The second root is the one we want, and has the value,

$$ d = \sqrt{\frac{W{-1}\left(\frac{- \alpha{\min} d}{c{\min} n{\min}}\right)}{- \alpha_{\min}}} $$

(the same formula, but with the Lambert W function replacing the logarithm.

We could solve for other values of $\ell$ using a rootfinder, but I don't know that it is worth it. I'd use the logarithm for $\ell = 2$ and the Lambert W otherwise; as long as we aren't too agressive with $\epsilon$ (and we shouldn't be) it will be fine.

The procedure then is to:

Group grid points according to the memory available.
For each group of grid points, compute a mask array where each shell is either True (included) or False (not included) based on whether it is further than $d$. This mask array has size $n{grid} \times n{shells}$.
Evaluate the density but include only contributions from the included shells. This should be possible with a masked-linear-algebra operation.
Continue to other groups of grid points.

In the case where no screening is done, this amounts to a "chunked" version of the old algorithm I think.

msricher commented 1 month ago

I have come up with a mask function:

def _evaluate_orb_mask(basis, points, tol_screen):
    orb_mask = np.zeros((len(basis), points.shape[0]), dtype=bool)
    for mask, shell in zip(orb_mask, basis):
        cmin = shell.coeffs.min()
        amin = shell.exps.min()
        nmin = (
            (2 * amin / np.pi) ** (3 / 4)
            * ((4 * amin) ** (shell.angmom / 2))
            / np.sqrt(factorial2(2 * shell.angmom + 1))
        )
        dist = np.sqrt(-(np.log if shell.angmom == 0 else lambertw)(tol_screen / (cmin * nmin)) / amin)
        np.less_equal(dist ** 2, np.sum((points - shell.coord) ** 2, axis=1), out=mask)
    return orb_mask

I'm working on putting it together with the existing code to evaluate the basis and then the density with it, but it will take me some time. I'm not sure how it should be done, so I'm working through the code. I think I will have to edit the BaseOneIndex class, the Eval class, and then the density functions.

leila-pujal commented 1 month ago

This looks good @msricher. Feel free to post here your thoughts on what are you planning to modify in the modules you mentioned.

msricher commented 1 month ago

OK, so I have the mask function, and the chunking is implemented. My issue now is that, using the mask, I am not sure of where I can save memory.

I need to pass the mask into evaluate_basis https://github.com/theochem/gbasis/blob/3e063ad3a3ce40430ac1eafeffc6f92e79888ff1/gbasis/evals/eval.py#L114 and then in turn into Eval construct_array_contraction https://github.com/theochem/gbasis/blob/3e063ad3a3ce40430ac1eafeffc6f92e79888ff1/gbasis/evals/eval.py#L56 and possibly the BaseOneIndex methods.

I haven't been able to figure out yet how to apply the mask in a way that will save memory. The chunking is fine, but I have doubts about this.

leila-pujal commented 1 month ago

@msricher, could you report how memory performance improves with only the chunking? @PaulWAyers , I have a doubt for the algorithm, for the mask, is the goal to reduce the size of orb_eval here using the mask?

msricher commented 1 month ago

See this comment. It's dynamic, though, based on the amount of memory available. I just set it manually to 4 chunks to show this.

msricher commented 1 month ago

mprof I was able to make the chunking better. This is with 8 chunks. Normally, the chunks will be dynamically determined so that no chunking occurs if there is enough memory, but I manually set it to 8 to show this.

This works very well now, and I think will suffice to fix the memory issue. We can still work on the screening, though.

Please see the PR #191 .

PaulWAyers commented 1 month ago

@PaulWAyers , I have a doubt for the algorithm, for the mask, is the goal to reduce the size of orb_eval here using the mask?

The idea is to reduce the CPU time and memory by not evaluating every orbital at every point. This gets density evaluation down to linear-scaling in computation and memory, which is what it should be. (Right now it is quadratic scaling in computation and cubic in memory.)

leila-pujal commented 1 month ago

Okay I see.... I think we are talking about the same thing but just to double check; When doing this in the code: $\rho(r) = \sum_{\mu \nu} \phi(r)^{*}_{N_{points}\mu}P_{\mu \nu} \phi(r)_{\nu,N_{points}}$ Are you expecting with the mask to have $\mu$ and $\nu$ less than the total number of basis functions? I have this doubt because if this is the case of applying the mask then does the $P_{\mu \nu}$ have to also change its dimension? I am just thinking this could be potentially avoided with a for lop I think but I just wanted to double check with you if my thinking is correct, thanks for your help. In any case I think @msricher has improved the memory leaking a lot by the chunking already.

PaulWAyers commented 1 month ago

I had a chat with Michelle and there are potentially lots of ways to do this. We could store $\phi_k(\mathbf{r}_g)$ as a sparse array, do masked-array-operations, keep a list of the important indices and evaluate them on the fly, etc.. I'm not sure what the best implementation strategy would be, but whatever it is will have major implications as the same trick works for almost every evaluation in gbasis. I expect it requires playing around with things and doing some testing, but it also depends on technical details (the relative speed of sparse-array-operations and masked-array-operations, for example).

My guess is that it is simplest, fastest, and most direct to store AO evaluations on the grid as a sparse ($N{AO} \times N{grid}$) matrix, keeping in mind that one may be chunking so perhaps not all the grid points are treated at once.

theochem / gbasis

[BUG] Big memory usage for density evaluation #121