Open leila-pujal opened 3 years ago
The memory-doubling should be avoidable I think.
A bigger issue is that we should be using some screening to not evaluate all the orbital overlaps, because we don't need them all. That is going to take a bit of refactoring, though perhaps @BradenDKelly has already looked at this?
@leila-pujal @PaulWAyers yes, there are two different PRs for screening. Either one would save some time/calculations. I will investigate once home.
@leila-pujal @marco-2023 this is the issue Leila was talking about. If there is a pull request open, it may be worth reviewing and merging it.
I checked and the PR related to this seems to be #110 and #111. I will check them and see if they can help us to fix this problem.
I opened a PR #166 that fixes this bug. Below there is my testing code that uses the new grid (iron_hexacarbonyl.fchk is provided at the top of the issue):
import numpy as np
from iodata import load_one
from grid.becke import BeckeWeights
from grid.molgrid import MolGrid
from grid.onedgrid import GaussChebyshev
from grid.rtransform import BeckeRTransform
from gbasis.evals.density import evaluate_density
from gbasis.wrappers import from_iodata
mol=load_one("iron_hexacarbonyl.fchk")
dm = mol.one_rdms['scf']
basis = from_iodata(mol)
becke = BeckeWeights(order=3)
aim_weights = BeckeWeights(radii=None, order=3)
oned = GaussChebyshev(100)
rgrid = BeckeRTransform(1e-4, 1.5).transform_1d_grid(oned)
grid = MolGrid.from_preset(mol.atnums, mol.atcoords, 'insane', rgrid, aim_weights=aim_weights)
d=evaluate_density(dm, basis, grid.points)
This gives a smaller grid than the one used when this issue was opened. With the old code, we get the doubling in memory in line 45 in evaluate_density_using_evaluated_orbs
:
With the new code we avoid creating the intermediate matrix density and the memory stays the same:
I commented on PR #166, I talked with @Ali-Tehrani earlier and he pointed out that condensing the lines of code does not solve the problem. I confirmed this earlier this week because I actually had a similar problem, the profiler only tracks memory increased/decreased before and after lines of code but doing the operations inside the np.sum it creates the array and deletes it at the same time. At this point, I am not sure if this doubling in memory can be avoided. I don't remember if we talked about some other alternative for this part of the code to avoid this happening.
There are external libraries to fix this, like numexpr
. We used to use the linear algebra factories for this.
I think there is a way to do this with numpy "in place" operations but I am no expert.
Don't think "in-place" will work for this specific case. The other suggestions alongside using "in-place" would be:
To use einsum, setting optimize="optimal" or "greedy". This will find intermediate optimizations, see here for how. Don't think it will help in this specific case.
Knowing how much available memory there is, then calculating optimal number of points to fit in memory (minus a safe margin) then computing slices at a time. In this example, assuming double precision it would be solving the number of points N
in K= (M^2 + 2NM + N)*8
, where K
=available memory in bytes, 8 is from 8 bytes in a double and M
=number of MO. M^2
comes from one-rdm, NM
comes from orb_eval, NM
comes from density-intermediate array and N
is the size of the output.
@FarnazH @PaulWAyers
I think I was able to make some progress with this.
I changed line 60 of gbasis/evals/density.py and got this:
112 993.4 MiB 0.0 MiB 1 if one_density_matrix.shape[0] != orb_eval.shape[0]:
113 raise ValueError(
114 "Number of rows (and columns) of the density matrix must be equal to the number of rows"
115 " of the orbital evaluations."
116 )
117
118 1204.3 MiB 210.9 MiB 1 density = np.einsum("pq,qn,pn->n", one_density_matrix, orb_eval, orb_eval, optimize=True)
119 1204.3 MiB 0.0 MiB 1 return density
Compare with the original:
55 993.3 MiB 0.0 MiB 1 if one_density_matrix.shape[0] != orb_eval.shape[0]:
56 raise ValueError(
57 "Number of rows (and columns) of the density matrix must be equal to the number of rows"
58 " of the orbital evaluations."
59 )
60
61 1815.7 MiB 822.4 MiB 1 density = one_density_matrix.dot(orb_eval)
62 1815.7 MiB 0.0 MiB 1 density *= orb_eval
63 1815.7 MiB 0.0 MiB 1 return np.sum(density, axis=0)
So, this is my solution which uses like 130% of the memory instead of 200%:
density = np.einsum("pq,qn,pn->n", one_density_matrix, orb_eval, orb_eval, optimize=True)
I looked at screening, but I'm really unsure of how it could be used here. This is my first contribution to the Python code proper and I've been a bit confused trying to understand it all.
Let me know if this suffices and I will open a PR. Cheers.
Hi @msricher, thanks for working on this. Could you double check this is not the same situation as in here where the profiler can't monitor the memory that is being used within that line of code?
@leila-pujal I believe it's ok. einsum
will try to avoid intermediates where possible.
I did a quick check using command top
to monitor real-time the memory usage. I get the same demand with both codes but I may be doing something wrong. I tried to also use tracemalloc
module and I think I also see the same peak with both codes but I don't understand tracemalloc that well. On my side, I would like to see a more detailed evaluation of the memory because as it happened before the profiler only checks changes between lines not what happens inside the line. But if other reviewers are fine with it I will merge PR #191
OLD
NEW
The memory is reduced somewhat.
I used the mprof
executable that comes with memory_profiler
for this, which tracks memory over time.
EDIT: I had these mixed up. It seems this takes even more memory at the peak. So this isn't a good solution.
Yeah that is what I saw with the top
command in some runs I did.
I have some more information here. I profiled this functionality while indicating the boundaries between functions. It seems the biggest spike occurs in evaluate_basis()
, while evaluating contractions. Then, I was able to improve the memory usage in evaluate_density_using_evaluated_orbs
by chunking the computation over grid points. You can see in the graph that I used 4 chunks. I'm looking at optimizing this chunking, but for now, does this seem like an OK strategy?
I think chunking has to be done; then screening gets us to having the optimal scaling. The issue is that the "chunk size" needs to be adapted to the memory available. Once the chunk size is large enough, I expect the performance hit to be minimal.
I think my notes (within the pull request) really belong here, so I'm moving them over.
====
With transform=None
, you can screen. The key idea is to keep track of which atomic-orbitals are needed for a given grid point.
For a grid point that is $d$ units way from the center of the basis function, the value of the (contracted) basis function is greater than
$$ c{\min} n{\min} d^\ell e^{-\alpha_{min} d^2} $$
$$ n{\min} = \sqrt{\frac{\sqrt{2 \alpha{\min}}(4\alpha_{\min})^\ell}{\sqrt{\pi}(2 \ell + 1)!!}} $$
where $c{\min}$, $n{\min}$, and $\alpha_{\min}$ denote the contraction coefficient, normalization constant, and exponent of the most diffuse Gaussian in a contracted basis function. If the cutoff tolerance is $\epsilon$, you do not need to include this basis function for grid points that are further away than
$$ \epsilon = c{\min} n{\min} d^\ell e^{-\alpha_{min} d^2} $$
This equation can be solved exactly for $\ell=0$ (in terms of logs) and $\ell=2$ (in terms of the Lambert W function). Otherwise one needs to solve it numerically. For $\ell = 0$,
$$ d = \left(-\frac{\ln \frac{\epsilon}{ c{\min} n{\min}}}{\alpha_{\min}}\right)^{\frac{1}{2}} $$
For $\ell = 2$ one has the equation,
$$ \frac{- \alpha{\min} d}{c{\min} n{\min}} = -\alpha{\min}d^2 e^{-\alpha_{min} d^2} $$
There are two roots, one near the nucleus (where d^2 is small) and one far out. The second root is the one we want, and has the value,
$$ d = \sqrt{\frac{W{-1}\left(\frac{- \alpha{\min} d}{c{\min} n{\min}}\right)}{- \alpha_{\min}}} $$
(the same formula, but with the Lambert W function replacing the logarithm.
We could solve for other values of $\ell$ using a rootfinder, but I don't know that it is worth it. I'd use the logarithm for $\ell = 2$ and the Lambert W otherwise; as long as we aren't too agressive with $\epsilon$ (and we shouldn't be) it will be fine.
The procedure then is to:
In the case where no screening is done, this amounts to a "chunked" version of the old algorithm I think.
I have come up with a mask function:
def _evaluate_orb_mask(basis, points, tol_screen):
orb_mask = np.zeros((len(basis), points.shape[0]), dtype=bool)
for mask, shell in zip(orb_mask, basis):
cmin = shell.coeffs.min()
amin = shell.exps.min()
nmin = (
(2 * amin / np.pi) ** (3 / 4)
* ((4 * amin) ** (shell.angmom / 2))
/ np.sqrt(factorial2(2 * shell.angmom + 1))
)
dist = np.sqrt(-(np.log if shell.angmom == 0 else lambertw)(tol_screen / (cmin * nmin)) / amin)
np.less_equal(dist ** 2, np.sum((points - shell.coord) ** 2, axis=1), out=mask)
return orb_mask
I'm working on putting it together with the existing code to evaluate the basis and then the density with it, but it will take me some time. I'm not sure how it should be done, so I'm working through the code. I think I will have to edit the BaseOneIndex
class, the Eval
class, and then the density functions.
This looks good @msricher. Feel free to post here your thoughts on what are you planning to modify in the modules you mentioned.
OK, so I have the mask function, and the chunking is implemented. My issue now is that, using the mask, I am not sure of where I can save memory.
I need to pass the mask into evaluate_basis
https://github.com/theochem/gbasis/blob/3e063ad3a3ce40430ac1eafeffc6f92e79888ff1/gbasis/evals/eval.py#L114 and then in turn into Eval construct_array_contraction
https://github.com/theochem/gbasis/blob/3e063ad3a3ce40430ac1eafeffc6f92e79888ff1/gbasis/evals/eval.py#L56 and possibly the BaseOneIndex methods.
I haven't been able to figure out yet how to apply the mask in a way that will save memory. The chunking is fine, but I have doubts about this.
@msricher, could you report how memory performance improves with only the chunking? @PaulWAyers , I have a doubt for the algorithm, for the mask, is the goal to reduce the size of orb_eval
here using the mask?
See this comment. It's dynamic, though, based on the amount of memory available. I just set it manually to 4 chunks to show this.
I was able to make the chunking better. This is with 8 chunks. Normally, the chunks will be dynamically determined so that no chunking occurs if there is enough memory, but I manually set it to 8 to show this.
This works very well now, and I think will suffice to fix the memory issue. We can still work on the screening, though.
Please see the PR #191 .
@PaulWAyers , I have a doubt for the algorithm, for the mask, is the goal to reduce the size of
orb_eval
here using the mask?
The idea is to reduce the CPU time and memory by not evaluating every orbital at every point. This gets density evaluation down to linear-scaling in computation and memory, which is what it should be. (Right now it is quadratic scaling in computation and cubic in memory.)
Okay I see.... I think we are talking about the same thing but just to double check; When doing this in the code:
$\rho(r) = \sum_{\mu \nu} \phi(r)^{*}_{N_{points}\mu}P_{\mu \nu} \phi(r)_{\nu,N_{points}}
$ Are you expecting with the mask to have $\mu
$ and $\nu
$ less than the total number of basis functions? I have this doubt because if this is the case of applying the mask then does the $P_{\mu \nu}$ have to also change its dimension? I am just thinking this could be potentially avoided with a for lop I think but I just wanted to double check with you if my thinking is correct, thanks for your help. In any case I think @msricher has improved the memory leaking a lot by the chunking already.
I had a chat with Michelle and there are potentially lots of ways to do this. We could store $\phi_k(\mathbf{r}_g)$ as a sparse array, do masked-array-operations, keep a list of the important indices and evaluate them on the fly, etc.. I'm not sure what the best implementation strategy would be, but whatever it is will have major implications as the same trick works for almost every evaluation in gbasis
. I expect it requires playing around with things and doing some testing, but it also depends on technical details (the relative speed of sparse-array-operations and masked-array-operations, for example).
My guess is that it is simplest, fastest, and most direct to store AO evaluations on the grid as a sparse ($N{AO} \times N{grid}$) matrix, keeping in mind that one may be chunking so perhaps not all the grid points are treated at once.
Describe the bug
Big memory usage for density evaluation **To Reproduce** I attach the system I used to test: [iron_hexacarbonyl.tar.gz](https://github.com/theochem/gbasis/files/6792948/iron_hexacarbonyl.tar.gz) ``` mol=load_one("iron_hexacarbonyl.fchk") dm = mol.one_rdms['scf'] basis, coord_types = from_iodata(mol) becke = BeckeWeights(order=3) oned = GaussChebyshevType2(100) rgrid = BeckeTF(1e-4, 1.5).transform_1d_grid(oned) grid = MolGrid.horton_molgrid(mol.atcoords, mol.atnums, rgrid, 1030, becke) evaluate_density(dm, basis, grid.points, coord_type=coord_types) ``` **Expected behaviour**I don't know if this could be improved, I am just reporting the behaviour as an issue.
Screenshots
System Information:
Additional context
For computing the density it starts with
evaluate_density()
, then evaluates each orbital by computing the density for the basis functions and then it obtains the total density by using the density matrix. For evaluating the basis functions density, it is done in_deriv.py
module inside_eval_deriv_contractions()
. To monitor the memory I used several prints withnbytes
for arrays and@profile
frommemory_profiler import profile
. To monitor the memory of each contraction I printed the associated memory of matrix_contraction in base one https://github.com/theochem/gbasis/blob/c57de8f1cfbe1dbaac8941c7cb9993797c76fca6/gbasis/base_one.py#L239. Then I profiled the memory in_eval_deriv_contractions()
,evaluate_density()
andevaluate_density_using_evaluated_orbs
. With the information I gathered I think the memory dependcy in the code is as follows. Through the evaluation of all the generalized contractions, the memory will scale as a function of AO orbitals generated for each angular momentum. S-generalized contractions are the lowest memory cost and to obtain the memory cost for higher angular momentum generalized contractions you have to multiply the memory cost of the S-generalized contractions by the number of generated atomic orbitals (e.g l=1, 3p orbitals. Increase in memory = S-generalized contractions-memory x 3) The base memory corresponding to S-generalized contractions scales linearly with the number of points of the molecular grid. I fitted some values for not so big molecular sizes and I got the following regressionS-generalized contractions memory (MB)=(0.000008150229943 *n_points) - 0.24118548
. I checked to predict for some bigger ones and the predicted value was around 2MB wrong. To show the agreement of the memory predicted with the one monitored I used iron_hexacarbonyl:For a radial grid =100points angular = 1030 Total = 100 1202 13atoms = 1562600
Base memory for S-generalized contractions = 12.49MB
The values I obtained profiling the code are shown in the following pictures:
this is the profiling for
evaluate density()
You can see evaluates basis increases almost the same memory that it has been calculated in line 78.
But also, there's a spike of memory that it is not shown here that occurs in
evaluate_density_using_evaluated_orbs
. The profiling for this is shown in the following image.Here you can see in line 63 where the memory is doubled.