Parallel execution may be slower than serial

madeleineudell / ParallelSparseRegression.jl

Other

12 stars 7 forks source link

@madeleineudell Thanks for this package, here is the output of my initial testing:

Sample program:

using ParallelSparseRegression

m,n,p = 2048,1024,.1
A = sprand(m,n,p)
x0 = Base.shmem_randn(n)
b = A*x0
rho = 1
lambda = 1
quiet = false
maxiters = 100

params = Params(rho,quiet,maxiters)

# Lasso
@time z_lasso = lasso(A,b,lambda; params=params)

Calling the following program with different addprocs values gives the following:

Output without addprocs:

1000 : 1.76e+00 1.27e-01 5.54e-03 4.09e+01 elapsed time: 24.422318823 seconds (6440755392 bytes allocated)

Output with addprocs(3):

1000 : 2.15e+00 1.12e-01 6.07e-03 4.65e+01 elapsed time: 90.979009048 seconds (12805856436 bytes allocated)

Output with addprocs(7):

1000 : 1.75e+00 1.47e-01 5.74e-03 4.21e+01 elapsed time: 228.324713722 seconds (28927210844 bytes allocated)

Full output with values for every iteration:

https://gist.github.com/ingenieroariel/9095001

@ingenieroariel, thanks for doing those tests. My guess is that the slowdown is caused by repeatedly allocating shared memory, which is a somewhat slow process. The code in prox.jl, admm.jl, and (possibly) IterativeSolvers:lsqr.jl will have to be modified to overwrite previously allocated memory rather than allocating new memory. Eg right now we call lsqr, when we should call lsqr!, lsqr may call A_mul_b instead of A_mul_b!, the various prox functions don't yet overwrite their inputs... etc. We may even want to do in-place summation: if a and b are shared arrays, a+b is a normal array, and a new shared array will need to be allocated if we multiply a+b by a shared matrix; we can do better by assigning a.s += b and then multiplying a by the shared matrix.

madeleineudell / ParallelSparseRegression.jl