Accelerating SGX - Githubissues

kylebystrom commented 3 years ago

Hi PySCF Developers,

I am interested in doing some coding to accelerate the SGX module but wanted to ask for feedback and advice first since I'm new to molecular integral evaluation. I did some basic analysis and profiling of the code below.

I have been using the SGX module in my research and have found it quite useful. However, I have also noticed that it is significantly slower than ORCA's COSX, in spite of the fact that all the computationally expensive parts of the PySCF implementation are written in parallel in C. I have noticed several differences between the ORCA COSX and PySCF SGX, and I was wondering if you had any insights into which of these differences might result in the performance difference.

ORCA does P-junction screening, while PySCF does not. In addition, COSX computes the exchange potential for density matrix differences, rather than total density matrices. Therefore, even for small molecules COSX can get significant P-junction screening.
ORCA vectorizes their ESP integrals by putting the real-space coordinates on the inner loop.
ORCA uses McMurchie-Davidson integration and calculates the incomplete gamma function directly, and they store all the Hermite expansion coefficients and reuse them for different grid points. I am not sure if there is any way to do this for Rys quadrature, which is the approach used by libcint. Are there any quantities that can be precomputed and reused within the Rys quadrature approach that are not already? I saw that the pair data is precomputed for 3-center, 2-electron integrals.

I also did some tests and found that while PySCF's analytical exchange is only a bit slower than ORCA, its SGX is much slower. The table below shows seconds per SCF iteration for B3LYP/def2-QZVPPD. I did methane and ethane on 1 core and octane on 16 cores. I am not sure why ORCA is so slow and PySCF so fast for methane with analytical integration, but the other data illustrate the performance difference well. Also, I think PySCF's grids are a bit larger than ORCA's for COSX (PySCF has more radial grids), but that does not account for the performance difference.

	pyscf-analytical	pyscf-sgx	orca-analytical	orca-cosx
methane	2.0	6.8	6.4	2.3
ethane	39.1	27.8	35.4	9.2
octane	157.3	62.9	112.8	12.8

I have also attached a couple of valgrind-callgrind output files, one with inclusive timing (profile_inclusive.txt) and one with exclusive timing (profile_exclusive.txt). What stood out to me is the time taken by the recurrence relations, Cartesian to Spherical functions, and SGX contraction, which might be accelerated significantly by vectorization. profile_exclusive.txt profile_inclusive.txt

Do you have any feedback or suggestions for approaching this problem?

Thanks, Kyle

P.S. Here is the script used to generate the profile output:

from pyscf import gto, dft
from pyscf.sgx.sgx import sgx_fit
import time
atom = '''
H 0.0000000 0.0000000 0.0000000
H 0.0000000 0.0000000 0.7400000
'''
mol = gto.Mole(atom=atom)
mol.basis = 'def2-qzvppd'
mol.verbose = 4
mol.build()
start = time.monotonic()
ks = dft.RKS(mol)
ks.xc = 'B3LYP'
ks = sgx_fit(ks)
ks.with_df.dfj = True
ks.kernel()
stop = time.monotonic()
sgx_time = stop - start
print()
print('SEMINUMERICAL X TIME:', sgx_time)

gkc1000 commented 3 years ago

Hi, thanks for your interest in PySCF! Improving the SGX module would be great.

Here are my guesses for what you are seeing. The critical thing in exchange in larger molecules is almost always going to be screening

for analytical exchange, methane will likely be testing the pure integral speed, while octane will be testing the screening thresholds. The different scaling of the times between PySCF and ORCA suggests PySCF is computing integrals more quickly, but is using less aggressive screening. You should check if ORCA and PySCF are computing the same number of integrals.
The first point: screening based on the density matrix and exchange potential due to DM density matrix is probably the largest factor. @sunqm any thoughts?

kylebystrom commented 3 years ago

Hi, thanks for the guidance! I double-checked the grid sizes and realized that I did in fact make a mistake; the grid sizes are significantly larger for PySCF. I was assuming ORCA was using the grid sizes from the original COSX paper, but they shrunk the grids significantly after introducing overlap fitting (a technique also included in the PySCF SGX module). I tried increasing the COSX grid size to be about the same as PySCF's, and now they get similar performance for methane. That suggests that you are right that PySCF has worse scaling because of less aggressive screening and no P-junction screening.

Based on that, I think it makes sense that the first thing I should do is implement P-junction screening and test that. Then I can examine the structure of the integral evaluation more closely and see if anything can be improved on that end too.

sunqm commented 3 years ago

Exactly. For analytical exchange, the default integral screening cutoff and AO-direct SCF cutoff is quite conservative. It was set to 1e-13 in mf.direct_scf_tol. For organic molecules, setting to 1e-11 or even 1e-10 might be a better value that balances the speed and accuracy.

On Thu, Mar 25, 2021 at 07:29:05PM -0700, gkc1000 wrote:

Hi, thanks for your interest in PySCF! Improving the SGX module would be great.

Here are my guesses for what you are seeing. The critical thing in exchange in larger molecules is almost always going to be screening

for analytical exchange, methane will likely be testing the pure integral speed, while octane will be testing the screening thresholds. The different scaling of the times between PySCF and ORCA suggests PySCF is computing integrals more quickly, but is using less aggressive screening. You should check if ORCA and PySCF are computing the same number of integrals.

The first point: screening based on the density matrix and exchange potential due to DM density matrix is probably the largest factor. @sunqm any thoughts?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pyscf/pyscf/issues/865#issuecomment-807888444

--

sunqm commented 3 years ago

Integral screening and integration grids were not optimized in SGX. They are set very conservative to ensure accuracy.

For analytical integrals, more than the low level integral algorithm, there are many parameters to tune when optimizing integral code. The integral scheme in pyscf favors general contracted, high angular moment basis, e.g. ANO basis. Integrals are vectorized in a different library http://pyscf.org/pyscf/install.html#using-optimized-integral-library This library was optimized against Intel CPUs.

For small molecules like methane, pyscf caches all integrals in memory. This is probably the reason you see big difference. Profiling performance for small systems may show high memory accessing overhead. It does not tell accurately how the integrals code perform. For relatively large system, you should be able to observe ~90% time on integral library in which recurrence relations and Cartesian-spherical transformation may not spend that much time (depending on the choices of basis sets).

On Thu, Mar 25, 2021 at 03:36:08PM -0700, Kyle Bystrom wrote:

Hi PySCF Developers,

I am interested in doing some coding to accelerate the SGX module but wanted to ask for feedback and advice first since I'm new to molecular integral evaluation. I did some basic analysis and profiling of the code below.

I have been using the SGX module in my research and have found it quite useful. However, I have also noticed that it is significantly slower than ORCA's COSX, in spite of the fact that all the computationally expensive parts of the PySCF implementation are written in parallel in C. I have noticed several differences between the ORCA COSX and PySCF SGX, and I was wondering if you had any insights into which of these differences might result in the performance difference.

ORCA does P-junction screening, while PySCF does not. In addition, COSX computes the exchange potential for density matrix differences, rather than total density matrices. Therefore, even for small molecules COSX can get significant P-junction screening.

ORCA vectorizes their ESP integrals by putting the real-space coordinates on the inner loop.

ORCA uses McMurchie-Davidson integration and calculates the incomplete gamma function directly, and they store all the Hermite expansion coefficients and reuse them for different grid points. I am not sure if there is any way to do this for Rys quadrature, which is the approach used by libcint. Are there any quantities that can be precomputed and reused within the Rys quadrature approach that are not already? I saw that the pair data is precomputed for 3-center, 2-electron integrals.

I also did some tests and found that while PySCF's analytical exchange is only a bit slower than ORCA, its SGX is much slower. The table below shows seconds per SCF iteration for B3LYP/def2-QZVPPD. I did methane and ethane on 1 core and octane on 16 cores. I am not sure why ORCA is so slow and PySCF so fast for methane with analytical integration, but the other data illustrate the performance difference well. Also, I think PySCF's grids are a bit larger than ORCA's for COSX (PySCF has more radial grids), but that does not account for the performance difference.

pyscf-analytical pyscf-sgx orca-analytical orca-cosx

methane 2.0 6.8 6.4 2.3

ethane 39.1 27.8 35.4 9.2

octane 157.3 62.9 112.8 12.8

I have also attached a couple of valgrind-callgrind output files, one with inclusive timing (profile_inclusive.txt) and one with exclusive timing (profile_exclusive.txt). What stood out to me is the time taken by the recurrence relations, Cartesian to Spherical functions, and SGX contraction, which might be accelerated significantly by vectorization. profile_exclusive.txt profile_inclusive.txt

Do you have any feedback or suggestions for approaching this problem?

Thanks, Kyle

P.S. Here is the script used to generate the profile output:
from pyscf import gto, dft
from pyscf.sgx.sgx import sgx_fit
import time
atom = '''
H 0.0000000 0.0000000 0.0000000
H 0.0000000 0.0000000 0.7400000
'''
mol = gto.Mole(atom=atom)
mol.basis = 'def2-qzvppd'
mol.verbose = 4
mol.build()
start = time.monotonic()
ks = dft.RKS(mol)
ks.xc = 'B3LYP'
ks = sgx_fit(ks)
ks.with_df.dfj = True
ks.kernel()
stop = time.monotonic()
sgx_time = stop - start
print()
print('SEMINUMERICAL X TIME:', sgx_time)
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/pyscf/pyscf/issues/865

--

kylebystrom commented 3 years ago

Thank you for the explanation. I am working on implementing the P-junction screening as an optional screening step. My initial implementation only saves a few seconds (out of 1-2 minutes) on small molecules like methane and ethane, but I should be able to improve that by experimenting with direct_scf_cutoff like you suggested. I will also test to see if the P-junction screening brings the scaling in line with ORCA.

When you say that for large systems there may not be much time spent on recurrence relations and Cartesian-spherical transformation, what should be the bottleneck for large systems?

gkc1000 commented 3 years ago

I think the point is that improving scaling dwarfs the gains from constant prefactors (e.g. recurrence/Cartesian-spherical) etc.

BTW, since you have been working on this module recently, would you be interested in contributing a brief userdoc section? We are trying to cover all the modules before the 2.0 release.

kylebystrom commented 3 years ago

Sure! What is the timeline on the 2.0 release, just so I know when I need to have that done by? It might also make sense to have some more flexible options for the SGX module, as opposed to simply setting the initial and final level. Maybe we could keep the conservative grids used currently as the default and add some options for coarser grids and more aggressive screening.

Also, after spending more time studying the code, I see what you mean by the importance of prefactors compared to scaling, especially for these small and medium-size systems. I also noticed that qcint leverages x86 SIMD instructions by waiting to perform recurrence relations until a certain number of primitives have been looped over; would it be useful to have an integral evaluator optimized for real-space grid evaluation by computing the integrals for small batches of grid points using SIMD? This might make the SIMD performance gains even bigger by leveraging it even for uncontracted basis functions, which are a fairly large fraction of the larger def2 basis sets.

gkc1000 commented 3 years ago

2.0alpha is April, so next week! We only need something very short, e.g. like in pyscf-doc/source/user/mp.rst -- this is just so a user knows how to call it and what the most important options are. We can always add more later. Intent is to have developer docs as well for each module, but it will be some time before we get there. Let me know if you are interested, and I can add you to the pyscf slack.

Re: SIMD, it sounds like a plausible strategy, but this would be working at the libqcint level ...

In general, we don't have enough developers willing to optimize on that level so that is great. Academically you may get more out of thinking of the higher level algos though.

On Tue, Mar 30, 2021 at 5:03 PM Kyle Bystrom @.***> wrote:

Sure! What is the timeline on the 2.0 release, just so I know when I need to have that done by? It might also make sense to have some more flexible options for the SGX module, as opposed to simply setting the initial and final level. Maybe we could keep the conservative grids used currently as the default and add some options for coarser grids and more aggressive screening.

Also, after spending more time studying the code, I see what you mean by the importance of prefactors compared to scaling, especially for these small and medium-size systems. I also noticed that qcint leverages x86 SIMD instructions by waiting to perform recurrence relations until a certain number of primitives have been looped over; would it be useful to have an integral evaluator optimized for real-space grid evaluation by computing the integrals for small batches of grid points using SIMD? This might make the SIMD performance gains even bigger by leveraging it even for uncontracted basis functions, which are a fairly large fraction of the larger def2 basis sets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pyscf/pyscf/issues/865#issuecomment-810655328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5NRFRSMUAZYM3CCZAC23TGJRGPANCNFSM4Z2IDLVA .

kylebystrom commented 3 years ago

I would be happy to add a short userdoc section for the sgx module this week. I definitely won't have any changes to the code itself ready by then, though.

I am interested in learning how to do the low-level optimization at the qcint level because I think having really efficient real-space two-center integrals combined with the autocode utilities of qcint would be very practical due to the popularity of COSX. I think the tricky thing with implementing SIMD to group grid points together is that it effectively requires vectorization from the top-level calls to qcint all the way down to the Rys roots, recurrence, cartesian-to-spherical, etc. A lot of that would be simple but tedious changes to those functions (changing many of the double pointers to SIMD vectors and rearranging how some of the data is stored). If such an approach does give worthwhile speedups, it might make more sense to have it as an optional add-on package.

gkc1000 commented 3 years ago

Great!

Better SIMD utilization would be great -- let us know how we can help. I'll add you to slack and we can also continue the dev discussion there with @sunqm.

On Tue, Mar 30, 2021 at 11:47 PM Kyle Bystrom @.***> wrote:

I would be happy to add a short userdoc section for the sgx module this week. I definitely won't have any changes to the code itself ready by then, though.

I am interested in learning how to do the low-level optimization at the qcint level because I think having really efficient real-space two-center integrals combined with the autocode utilities of qcint would be very practical due to the popularity of COSX. I think the tricky thing with implementing SIMD to group grid points together is that it effectively requires vectorization from the top-level calls to qcint all the way down to the Rys roots, recurrence, cartesian-to-spherical, etc. A lot of that would be simple but tedious changes to those functions (changing many of the double pointers to SIMD vectors and rearranging how some of the data is stored). If such an approach does give worthwhile speedups, it might make more sense to have it as an optional add-on package.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pyscf/pyscf/issues/865#issuecomment-810817688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5NRG2UUPVA3VRDLTT273TGLAPVANCNFSM4Z2IDLVA .

baopengbp commented 3 years ago

Hi Kyle，

I am glad you are interested in the sgX. I already have a linear scaling sgX in https://github.com/baopengbp/sgX/tree/master/LinsgX I think it is useful for you to develop sgX.

Peng

kylebystrom commented 2 years ago

Addressed by #945

pyscf / pyscf

Accelerating SGX #865