Force prediction is very slow

qzhu2017 commented 3 years ago

@yanxon Have you tried to optimize the code for force prediction?

qzhu2017 commented 3 years ago

figure out how to optimize the current code
probably better to predict the forces for the entire structure
prediction from interpolation

qzhu2017 commented 3 years ago

@yanxon Can you reply to this message? If you are doing it now, I will wait. Otherwise, I will have a go today.

qzhu2017 commented 3 years ago

Will do multiple process for kff calculation

qzhu2017 commented 3 years ago

Add a test script for d2d_dx1dx2. See if there is a way to speed up that loop only.

qzhu2017 commented 3 years ago

2020/10/18

        19246690 function calls (19069682 primitive calls) in 343.335 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   599491  146.753    0.000  146.753    0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
       79   48.006    0.608   48.006    0.608 SO3.py:660(get_power_spectrum_components)
        5    0.008    0.002  217.959   43.592 RBF_mb.py:324(k_total_with_stress)
        3    0.002    0.001   57.388   19.129 RBF_mb.py:73(k_total)

        5    0.215    0.043  207.679   41.536 RBF_mb.py:391(kff_many_with_stress)
        2    0.076    0.038   51.672   25.836 RBF_mb.py:190(kff_many)
    36367    0.191    0.000  258.473    0.007 RBF_mb.py:485(kff_single)
    36367   20.359    0.001  258.282    0.007 derivatives.py:76(K_ff)
    36367   10.278    0.000  111.655    0.003 derivatives.py:230(fun_d2D_dx1dx2)
    36367   67.505    0.002   68.913    0.002 derivatives.py:158(fun_d2d_dx1dx2)
    87294    7.762    0.000   11.020    0.000 derivatives.py:151(fun_dd_dx2)
    72734    5.519    0.000    8.328    0.000 derivatives.py:141(fun_dd_dx1)
   104004    2.472    0.000    3.668    0.000 derivatives.py:13(fun_D)
    50927    1.002    0.000    2.529    0.000 derivatives.py:18(fun_dk_dD)

    50927    0.414    0.000    9.513    0.000 derivatives.py:225(fun_dD_dx2)
    14560    0.301    0.000   15.222    0.001 derivatives.py:114(K_ef)
    36367    0.298    0.000    5.681    0.000 derivatives.py:220(fun_dD_dx1)
     1075    0.071    0.000    0.322    0.000 derivatives.py:40(K_ee)
    52002    0.585    0.000    0.893    0.000 RBF_mb.py:567(get_mask)
        5    0.049    0.010    9.818    1.964 RBF_mb.py:341(kef_many_with_stress)
        8    0.033    0.004    5.810    0.726 RBF_mb.py:265(kef_many)
    14560    0.032    0.000   15.254    0.001 RBF_mb.py:465(kef_single)
        7    0.003    0.000    0.356    0.051 RBF_mb.py:123(kee_many)
     1075    0.002    0.000    0.323    0.000 RBF_mb.py:444(kee_single)

2020/10/20

@luckychen @luckychen I just managed to reduce the number of function calls to speed up the Kff function in 45737b2. Now the run time for each K_ff function has been reduced from 7 ms to 5 ms. Would be good if we can do something further. Note, we don't need to recompute the path for einsum every time

         37892451 function calls (37402408 primitive calls) in 262.315 seconds
    36367   68.387    0.002   69.879    0.002 derivatives.py:155(fun_d2d_dx1dx2)
    36367   19.821    0.001  186.563    0.005 derivatives.py:68(K_ff)
    36367   10.782    0.000  114.682    0.003 derivatives.py:227(fun_d2D_dx1dx2)
    50927    5.182    0.000    7.416    0.000 derivatives.py:148(fun_dd_dx2)
    36367    2.835    0.000    4.347    0.000 derivatives.py:138(fun_dd_dx1)
   104004    2.536    0.000    3.778    0.000 derivatives.py:8(fun_D)
    50927    0.878    0.000    2.980    0.000 derivatives.py:13(fun_k)
    14560    0.360    0.000   15.388    0.001 derivatives.py:111(K_ef)
    50927    0.260    0.000    3.240    0.000 derivatives.py:20(fun_dk_dD)
    14560    0.147    0.000    4.510    0.000 derivatives.py:222(fun_dD_dx2)
     1075    0.074    0.000    0.320    0.000 derivatives.py:32(K_ee)
        1    0.000    0.000    0.000    0.000 derivatives.py:5(<module>)

   464214   43.115    0.000   43.115    0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
    62607    2.392    0.000    9.947    0.000 einsumfunc.py:706(einsum_path)
   464150    1.636    0.000   82.479    0.000 einsumfunc.py:997(einsum)

import numpy as np
from time import time

def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):

    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    tmp0 = np.ones(x2.shape)
    x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
    tmp1 = x1x1/x2_norm[None,:,None,None]
    x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
    tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
    out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])

    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
    out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]

    return out2 - out1

t0 = time()
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)

out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)

qzhu2017 commented 3 years ago

add parallel 8205efb

yanxon commented 3 years ago

@qzhu2017

Is K_ff quick enable in training mode?

qzhu2017 commented 3 years ago

@yanxon K_ff quick is actually much slower than the regular Kff.

yanxon commented 3 years ago

@qzhu2017

Here is the optimized code. I just rearranging all the terms here. I check the results by summing the 4D array. I think your code and this optimized code are equivalent.

import numpy as np
from time import time
np.random.seed(seed=13)

def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):

    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    # For tmp1
    tmp0 = np.ones(x2.shape)
    x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
    tmp1 = x1x1/x2_norm[None,:,None,None]

    # For tmp2
    x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
    tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]

    out1 = tmp1-tmp2

    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2

    out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]

    out = out2 - out1

    return out

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)

t0 = time()
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
print(np.sum(out))

In my computer, the original code you have run at 0.07942 s, while the optimized code runs at 0.05811 s.

qzhu2017 commented 3 years ago

import numpy as np
from time import time
#np.random.seed(seed=13)

def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):

    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    # For tmp1
    tmp0 = np.ones(x2.shape)
    x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
    tmp1 = x1x1/x2_norm[None,:,None,None]

    # For tmp2
    x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
    tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]

    out1 = tmp1-tmp2

    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2

    out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]

    out = out2 - out1

    return out

def fun_d2d_dx1dx2_old(x1, x2, x1_norm, x2_norm):

    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    tmp0 = np.ones(x2.shape)
    x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
    tmp1 = x1x1/x2_norm[None,:,None,None]
    x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
    tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
    out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])

    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
    out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]

    return out2 - out1

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
    x1 = np.random.random([m, k])
    x2 = np.random.random([n, k])
    x1_norm = np.linalg.norm(x1, axis=1)
    x2_norm = np.linalg.norm(x2, axis=1)

    t0 = time()
    out1 = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
    print("1:", time()-t0)
    t0 = time()
    out2 = fun_d2d_dx1dx2_old(x1, x2, x1_norm, x2_norm)
    print("2:", time()-t0)
    print(np.sum(np.abs(out1-out2))<1e-4)
#print(np.sum(out))

1: 0.18059730529785156
2: 0.19556140899658203
True
1: 0.1440131664276123
2: 0.19613170623779297
True
1: 0.14458227157592773
2: 0.19655108451843262
True
1: 0.14286065101623535
2: 0.19550585746765137
True
1: 0.14302682876586914
2: 0.19610929489135742
True
1: 0.14320039749145508
2: 0.19067859649658203
True
1: 0.1417226791381836
2: 0.19315719604492188
True
1: 0.14258503913879395
2: 0.19188952445983887
True
1: 0.14194321632385254
2: 0.19233441352844238
True
1: 0.14217805862426758
2: 0.19250869750976562
True

In average, the speed is about 0.142 .v.s 0.192

yanxon commented 3 years ago

@qzhu2017

This one is even faster. I believe this is the best it can do. So the idea is to multiply lower dimension first and go to higher dimension. For simplicity, let's say we have a = 10, b = 2, and A is 2x2 matrix. (ab)A is always faster than aAb.

import numpy as np
from time import time
np.random.seed(seed=13)

def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    # For tmp1
    tmp0 = np.ones(x2.shape)
    tmp1 = (x1[:,None,:]*(tmp0/x2_norm[:,None])[None,:,:])[:,:,None,:]*(x1/x1_norm3[:,None])[:,None,:,None]

    # For tmp2
    x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
    tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]

    out1 = tmp1-tmp2
    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2/x2_norm2[:,None]) # n*d1*d2
    out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
    out = out2 - out1
    return out

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)

t0 = time()
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
print(np.sum(out))

qzhu2017 commented 3 years ago

@yanxon Great. I will update the code later

yanxon commented 3 years ago

@luckychen

import numpy as np
from time import time

def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    tmp0 = np.ones(x2.shape)

    t0 = time()     # 0.011465072631835938
    x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
    print(time()-t0)

    t0 = time()     # 0.008262157440185547
    tmp1 = x1x1/x2_norm[None,:,None,None]
    print(time()-t0)

    t0 = time()     # 0.011075496673583984
    x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
    print(time()-t0)

    t0 = time()     # 0.01009225845336914
    tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
    print(time()-t0)

    t0 = time()     # 0.015599489212036133
    out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])
    print(time()-t0)

    t0 = time()     # 0.0006268024444580078
    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
    print(time()-t0)

    t0 = time()     # 0.014022350311279297
    out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]
    print(time()-t0)

    t0 = time() # 0.00873422622680664
    out = out2-out1
    print(time()-t0)

    return out

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)

t0 = time()     # 0.08103704452514648
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)

qzhu2017 commented 3 years ago

@luckychen @yanxon I just tried to reorganize the code and compared it with the previous one. The most expensive part is

computation of x1x2
subtraction between 2 4D arrays
division of the 4D array

Probably, it is about the upper limit within the python framework.

import numpy as np
from time import time
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):

    x1_norm3 = x1_norm**3        
    x2_norm3 = x2_norm**3      
    x2_norm2 = x2_norm**2      

    # For tmp1
    tmp0 = np.ones(x2.shape)
    x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
    tmp1 = x1x1/x2_norm[None,:,None,None]

    x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
    tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]
    out1 = tmp1-tmp2

    # For tmp2
    tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
    out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
    out = out2 - out1

    return out

def fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm): #total 0.14s

    x1_norm2 = x1_norm**2        
    x2_norm2 = x2_norm**2      
    x1x2_norm3 = (x1_norm[:,None] * x2_norm[None,:])**3

    # first term
    #x2x2 = np.einsum('ij,ik->ijk',x2,x2)
    #term1 = np.einsum('i,jkl->ijkl', x1_norm2, identy-x2x2)
    #identy = np.einsum("i,jk->ijk", x2_norm2, np.eye(x2.shape[1]))
    x2x2 = x2[:,:,None] * x2[:,None,:]
    identy = x2_norm2[:,None,None] * np.eye(x2.shape[1])[None,:,:]
    term1 = x1_norm2[:,None,None,None] * (identy-x2x2)[None,:,:,:]

    # second term
    #x1x2 = np.einsum("ij,ik,jl->ijkl", x1@x2.T, x1, x2) #, optimize='greedy')
    #x1x1 = np.einsum("jl,ikl->ijkl", x2_norm2[:,None]*np.ones(x2.shape), x1[:,:,None]*x1[:,None,:])

    x1x2 = x1[:,None,:,None] * x2[None,:,None,:] * (x1@x2.T)[:,:,None,None] # 0.04 s
    x1x1 = (x1[:,None,:]*x1[:,:,None])[:,None,:,:] * (np.ones(x2.shape)*(x2_norm2[:,None]))[None,:,None,:] #0.02
    term2 = x1x2 - x1x1 #0.02 s
    out = (term1+term2)/(x1x2_norm3[:,:,None,None])   #0.04 s
    return out

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
    x1 = np.random.random([m, k])
    x2 = np.random.random([n, k])
    x1_norm = np.linalg.norm(x1, axis=1)
    x2_norm = np.linalg.norm(x2, axis=1)

    t0 = time()
    out1 = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
    print("1:", time()-t0)

    t0 = time()
    out2 = fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm)
    print("2:", time()-t0)
    print(np.sum(np.abs(out1-out2))<1e-4)

luckychen commented 3 years ago

@qzhu2017 @yanxon Just check if that means the function (fun_d2d_dx1dx2_4) with reorganized code runs slower than the original one? Because I see the einsum line cost 0.010 and 0.011 seconds while the new code cost 0.04 and 0.02 seconds. I am trying to understanding the enisum function in numpy, based on my understanding, this function is calling fully optimized library (maybe optimized with SIMD) for 2-D and 3-D input/outputs, but not for 4-D. So convert the 4-d operations to multiple 3-d operations and using the enisum function may be a good point to improve the performance, because enisum will call optimized library for 3-D operations.

qzhu2017 commented 3 years ago

@luckychen The results are based on different CPU. For my case, both codes needs about 0.14 s. The reorganized version is probably by 0.003 s. I will take your suggestion and try if I can reshape the matrix from 4D to the 3D arrays.

qzhu2017 commented 3 years ago

@luckychen @yanxon I just tried to run it with cupy (a GPU version of numpy).

import cupy as cp
from time import time

def fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm):

    x1_norm2 = x1_norm**2        
    x2_norm2 = x2_norm**2      
    x1x2_norm3 = (x1_norm[:,None] * x2_norm[None,:])**3

    x2x2 = x2[:,:,None] * x2[:,None,:]
    identy = x2_norm2[:,None,None] * cp.eye(x2.shape[1])[None,:,:]
    term1 = x1_norm2[:,None,None,None] * (identy-x2x2)[None,:,:,:]

    x1x2 = x1[:,None,:,None] * x2[None,:,None,:] * (x1@x2.T)[:,:,None,None]
    x1x1 = (x1[:,None,:]*x1[:,:,None])[:,None,:,:] * (cp.ones(x2.shape)*(x2_norm2[:,None]))[None,:,None,:]
    term2 = x1x2 - x1x1 
    out = (term1+term2)/(x1x2_norm3[:,:,None,None])
    return out

m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
    x1 = cp.random.random([m, k])
    x2 = cp.random.random([n, k])
    x1_norm = cp.linalg.norm(x1, axis=1)
    x2_norm = cp.linalg.norm(x2, axis=1)

    t0 = time()
    out2 = fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm)
    print("2:", time()-t0)

This code is significantly faster, from 0.14 seconds to 0.6 ms I think the easiest way is to run it with GPU.

qzhu@cms CSP_BO (master) $ python 2.py 
2: 0.2059469223022461
2: 0.0006415843963623047
2: 0.0004477500915527344
2: 0.0004792213439941406
2: 0.00045228004455566406
2: 0.00043511390686035156
2: 0.0004382133483886719
2: 0.00045180320739746094
2: 0.0004372596740722656
2: 0.0004565715789794922

@luckychen Do you have time to have a quick chat tomorrow? I want to ask you some general questions regarding the code design.

qzhu2017 / CSP_BO

Force prediction is very slow #10

2020/10/18

2020/10/20