Closed qzhu2017 closed 3 years ago
@yanxon Can you reply to this message? If you are doing it now, I will wait. Otherwise, I will have a go today.
Will do multiple process for kff calculation
Add a test script for d2d_dx1dx2. See if there is a way to speed up that loop only.
19246690 function calls (19069682 primitive calls) in 343.335 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
599491 146.753 0.000 146.753 0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
79 48.006 0.608 48.006 0.608 SO3.py:660(get_power_spectrum_components)
5 0.008 0.002 217.959 43.592 RBF_mb.py:324(k_total_with_stress)
3 0.002 0.001 57.388 19.129 RBF_mb.py:73(k_total)
5 0.215 0.043 207.679 41.536 RBF_mb.py:391(kff_many_with_stress)
2 0.076 0.038 51.672 25.836 RBF_mb.py:190(kff_many)
36367 0.191 0.000 258.473 0.007 RBF_mb.py:485(kff_single)
36367 20.359 0.001 258.282 0.007 derivatives.py:76(K_ff)
36367 10.278 0.000 111.655 0.003 derivatives.py:230(fun_d2D_dx1dx2)
36367 67.505 0.002 68.913 0.002 derivatives.py:158(fun_d2d_dx1dx2)
87294 7.762 0.000 11.020 0.000 derivatives.py:151(fun_dd_dx2)
72734 5.519 0.000 8.328 0.000 derivatives.py:141(fun_dd_dx1)
104004 2.472 0.000 3.668 0.000 derivatives.py:13(fun_D)
50927 1.002 0.000 2.529 0.000 derivatives.py:18(fun_dk_dD)
50927 0.414 0.000 9.513 0.000 derivatives.py:225(fun_dD_dx2)
14560 0.301 0.000 15.222 0.001 derivatives.py:114(K_ef)
36367 0.298 0.000 5.681 0.000 derivatives.py:220(fun_dD_dx1)
1075 0.071 0.000 0.322 0.000 derivatives.py:40(K_ee)
52002 0.585 0.000 0.893 0.000 RBF_mb.py:567(get_mask)
5 0.049 0.010 9.818 1.964 RBF_mb.py:341(kef_many_with_stress)
8 0.033 0.004 5.810 0.726 RBF_mb.py:265(kef_many)
14560 0.032 0.000 15.254 0.001 RBF_mb.py:465(kef_single)
7 0.003 0.000 0.356 0.051 RBF_mb.py:123(kee_many)
1075 0.002 0.000 0.323 0.000 RBF_mb.py:444(kee_single)
@luckychen @luckychen I just managed to reduce the number of function calls to speed up the Kff function in 45737b2. Now the run time for each K_ff function has been reduced from 7 ms
to 5 ms
. Would be good if we can do something further. Note, we don't need to recompute the path for einsum every time
37892451 function calls (37402408 primitive calls) in 262.315 seconds
36367 68.387 0.002 69.879 0.002 derivatives.py:155(fun_d2d_dx1dx2)
36367 19.821 0.001 186.563 0.005 derivatives.py:68(K_ff)
36367 10.782 0.000 114.682 0.003 derivatives.py:227(fun_d2D_dx1dx2)
50927 5.182 0.000 7.416 0.000 derivatives.py:148(fun_dd_dx2)
36367 2.835 0.000 4.347 0.000 derivatives.py:138(fun_dd_dx1)
104004 2.536 0.000 3.778 0.000 derivatives.py:8(fun_D)
50927 0.878 0.000 2.980 0.000 derivatives.py:13(fun_k)
14560 0.360 0.000 15.388 0.001 derivatives.py:111(K_ef)
50927 0.260 0.000 3.240 0.000 derivatives.py:20(fun_dk_dD)
14560 0.147 0.000 4.510 0.000 derivatives.py:222(fun_dD_dx2)
1075 0.074 0.000 0.320 0.000 derivatives.py:32(K_ee)
1 0.000 0.000 0.000 0.000 derivatives.py:5(<module>)
464214 43.115 0.000 43.115 0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
62607 2.392 0.000 9.947 0.000 einsumfunc.py:706(einsum_path)
464150 1.636 0.000 82.479 0.000 einsumfunc.py:997(einsum)
import numpy as np
from time import time
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
tmp0 = np.ones(x2.shape)
x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
tmp1 = x1x1/x2_norm[None,:,None,None]
x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]
return out2 - out1
t0 = time()
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
add parallel 8205efb
@qzhu2017
Is K_ff quick enable in training mode?
@yanxon K_ff quick is actually much slower than the regular Kff.
@qzhu2017
Here is the optimized code. I just rearranging all the terms here. I check the results by summing the 4D array. I think your code and this optimized code are equivalent.
import numpy as np
from time import time
np.random.seed(seed=13)
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
# For tmp1
tmp0 = np.ones(x2.shape)
x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
tmp1 = x1x1/x2_norm[None,:,None,None]
# For tmp2
x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]
out1 = tmp1-tmp2
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
out = out2 - out1
return out
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
t0 = time()
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
print(np.sum(out))
In my computer, the original code you have run at 0.07942 s, while the optimized code runs at 0.05811 s.
import numpy as np
from time import time
#np.random.seed(seed=13)
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
# For tmp1
tmp0 = np.ones(x2.shape)
x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
tmp1 = x1x1/x2_norm[None,:,None,None]
# For tmp2
x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]
out1 = tmp1-tmp2
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
out = out2 - out1
return out
def fun_d2d_dx1dx2_old(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
tmp0 = np.ones(x2.shape)
x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
tmp1 = x1x1/x2_norm[None,:,None,None]
x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]
return out2 - out1
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
t0 = time()
out1 = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print("1:", time()-t0)
t0 = time()
out2 = fun_d2d_dx1dx2_old(x1, x2, x1_norm, x2_norm)
print("2:", time()-t0)
print(np.sum(np.abs(out1-out2))<1e-4)
#print(np.sum(out))
1: 0.18059730529785156
2: 0.19556140899658203
True
1: 0.1440131664276123
2: 0.19613170623779297
True
1: 0.14458227157592773
2: 0.19655108451843262
True
1: 0.14286065101623535
2: 0.19550585746765137
True
1: 0.14302682876586914
2: 0.19610929489135742
True
1: 0.14320039749145508
2: 0.19067859649658203
True
1: 0.1417226791381836
2: 0.19315719604492188
True
1: 0.14258503913879395
2: 0.19188952445983887
True
1: 0.14194321632385254
2: 0.19233441352844238
True
1: 0.14217805862426758
2: 0.19250869750976562
True
In average, the speed is about 0.142 .v.s 0.192
@qzhu2017
This one is even faster. I believe this is the best it can do. So the idea is to multiply lower dimension first and go to higher dimension. For simplicity, let's say we have a = 10, b = 2, and A is 2x2 matrix. (ab)A is always faster than aAb.
import numpy as np
from time import time
np.random.seed(seed=13)
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
# For tmp1
tmp0 = np.ones(x2.shape)
tmp1 = (x1[:,None,:]*(tmp0/x2_norm[:,None])[None,:,:])[:,:,None,:]*(x1/x1_norm3[:,None])[:,None,:,None]
# For tmp2
x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]
out1 = tmp1-tmp2
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2/x2_norm2[:,None]) # n*d1*d2
out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
out = out2 - out1
return out
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
t0 = time()
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
print(np.sum(out))
@yanxon Great. I will update the code later
@luckychen
import numpy as np
from time import time
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
tmp0 = np.ones(x2.shape)
t0 = time() # 0.011465072631835938
x1x1 = np.einsum("ijl,ik->ijkl", x1[:,None,:]*tmp0[None,:,:], x1)
print(time()-t0)
t0 = time() # 0.008262157440185547
tmp1 = x1x1/x2_norm[None,:,None,None]
print(time()-t0)
t0 = time() # 0.011075496673583984
x1x2 = np.einsum("ik,jl->ijkl", x1, x2/x2_norm3[:,None])
print(time()-t0)
t0 = time() # 0.01009225845336914
tmp2 = np.einsum("ijkl,ij->ijkl", x1x2, x1@x2.T)
print(time()-t0)
t0 = time() # 0.015599489212036133
out1 = (tmp1-tmp2)/(x1_norm3[:,None,None,None])
print(time()-t0)
t0 = time() # 0.0006268024444580078
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
print(time()-t0)
t0 = time() # 0.014022350311279297
out2 = tmp3[None, :, :, :]/x1_norm[:,None, None,None]/x2_norm[None,:,None,None]
print(time()-t0)
t0 = time() # 0.00873422622680664
out = out2-out1
print(time()-t0)
return out
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
t0 = time() # 0.08103704452514648
out = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print(time()-t0)
@luckychen @yanxon I just tried to reorganize the code and compared it with the previous one. The most expensive part is
Probably, it is about the upper limit within the python framework.
import numpy as np
from time import time
def fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm):
x1_norm3 = x1_norm**3
x2_norm3 = x2_norm**3
x2_norm2 = x2_norm**2
# For tmp1
tmp0 = np.ones(x2.shape)
x1x1 = (x1[:,None,:]*tmp0[None,:,:])[:,:,None,:]*(x1 / x1_norm3[:,None])[:,None,:,None]
tmp1 = x1x1/x2_norm[None,:,None,None]
x1x2 = (x1/x1_norm3[:,None])[:,None,:,None]*(x2/x2_norm3[:,None])[None,:,None,:]
tmp2 = x1x2 * (x1@x2.T)[:,:,None,None]
out1 = tmp1-tmp2
# For tmp2
tmp3 = np.eye(x2.shape[1])[None,:,:] - np.einsum('ij,ik->ijk',x2,x2)/x2_norm2[:,None,None] # n*d1*d2
out2 = tmp3[None,:,:,:]/(x1_norm[:,None]*x2_norm[None,:])[:,:,None,None]
out = out2 - out1
return out
def fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm): #total 0.14s
x1_norm2 = x1_norm**2
x2_norm2 = x2_norm**2
x1x2_norm3 = (x1_norm[:,None] * x2_norm[None,:])**3
# first term
#x2x2 = np.einsum('ij,ik->ijk',x2,x2)
#term1 = np.einsum('i,jkl->ijkl', x1_norm2, identy-x2x2)
#identy = np.einsum("i,jk->ijk", x2_norm2, np.eye(x2.shape[1]))
x2x2 = x2[:,:,None] * x2[:,None,:]
identy = x2_norm2[:,None,None] * np.eye(x2.shape[1])[None,:,:]
term1 = x1_norm2[:,None,None,None] * (identy-x2x2)[None,:,:,:]
# second term
#x1x2 = np.einsum("ij,ik,jl->ijkl", x1@x2.T, x1, x2) #, optimize='greedy')
#x1x1 = np.einsum("jl,ikl->ijkl", x2_norm2[:,None]*np.ones(x2.shape), x1[:,:,None]*x1[:,None,:])
x1x2 = x1[:,None,:,None] * x2[None,:,None,:] * (x1@x2.T)[:,:,None,None] # 0.04 s
x1x1 = (x1[:,None,:]*x1[:,:,None])[:,None,:,:] * (np.ones(x2.shape)*(x2_norm2[:,None]))[None,:,None,:] #0.02
term2 = x1x2 - x1x1 #0.02 s
out = (term1+term2)/(x1x2_norm3[:,:,None,None]) #0.04 s
return out
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
x1 = np.random.random([m, k])
x2 = np.random.random([n, k])
x1_norm = np.linalg.norm(x1, axis=1)
x2_norm = np.linalg.norm(x2, axis=1)
t0 = time()
out1 = fun_d2d_dx1dx2(x1, x2, x1_norm, x2_norm)
print("1:", time()-t0)
t0 = time()
out2 = fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm)
print("2:", time()-t0)
print(np.sum(np.abs(out1-out2))<1e-4)
@qzhu2017 @yanxon Just check if that means the function (fun_d2d_dx1dx2_4) with reorganized code runs slower than the original one? Because I see the einsum line cost 0.010 and 0.011 seconds while the new code cost 0.04 and 0.02 seconds. I am trying to understanding the enisum function in numpy, based on my understanding, this function is calling fully optimized library (maybe optimized with SIMD) for 2-D and 3-D input/outputs, but not for 4-D. So convert the 4-d operations to multiple 3-d operations and using the enisum function may be a good point to improve the performance, because enisum will call optimized library for 3-D operations.
@luckychen The results are based on different CPU. For my case, both codes needs about 0.14 s. The reorganized version is probably by 0.003 s. I will take your suggestion and try if I can reshape the matrix from 4D to the 3D arrays.
@luckychen @yanxon
I just tried to run it with cupy
(a GPU version of numpy).
import cupy as cp
from time import time
def fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm):
x1_norm2 = x1_norm**2
x2_norm2 = x2_norm**2
x1x2_norm3 = (x1_norm[:,None] * x2_norm[None,:])**3
x2x2 = x2[:,:,None] * x2[:,None,:]
identy = x2_norm2[:,None,None] * cp.eye(x2.shape[1])[None,:,:]
term1 = x1_norm2[:,None,None,None] * (identy-x2x2)[None,:,:,:]
x1x2 = x1[:,None,:,None] * x2[None,:,None,:] * (x1@x2.T)[:,:,None,None]
x1x1 = (x1[:,None,:]*x1[:,:,None])[:,None,:,:] * (cp.ones(x2.shape)*(x2_norm2[:,None]))[None,:,None,:]
term2 = x1x2 - x1x1
out = (term1+term2)/(x1x2_norm3[:,:,None,None])
return out
m, n, k, sigma2, l2, zeta = 50, 50, 50, 0.81, 0.01, 2
for i in range(10):
x1 = cp.random.random([m, k])
x2 = cp.random.random([n, k])
x1_norm = cp.linalg.norm(x1, axis=1)
x2_norm = cp.linalg.norm(x2, axis=1)
t0 = time()
out2 = fun_d2d_dx1dx2_4(x1, x2, x1_norm, x2_norm)
print("2:", time()-t0)
This code is significantly faster, from 0.14 seconds to 0.6 ms I think the easiest way is to run it with GPU.
qzhu@cms CSP_BO (master) $ python 2.py
2: 0.2059469223022461
2: 0.0006415843963623047
2: 0.0004477500915527344
2: 0.0004792213439941406
2: 0.00045228004455566406
2: 0.00043511390686035156
2: 0.0004382133483886719
2: 0.00045180320739746094
2: 0.0004372596740722656
2: 0.0004565715789794922
@luckychen Do you have time to have a quick chat tomorrow? I want to ask you some general questions regarding the code design.
@yanxon Have you tried to optimize the code for force prediction?