non-correct sparse matrix vector result

jaak-s commented 9 years ago

Following code gives different results from standard matrix product:

addprocs(8)
using ParallelSparseMatMul

m,n,p = 100000,300000,.001
A  = sprandn(m,n,p)
b1 = randn(n)

S = share(A)
x1 = share(b1)

u1 = A*b1;
v1 = S*x1;
norm(u1 - v1)
max_error = findmax(abs(u1 - v1))
u1[max_error[2]], v1[max_error[2]]

Where I get

julia> norm(u1 - v1)
742.9402897105464

julia> max_error = findmax(abs(u1 - v1))
(26.820280859004203,67035)

julia> u1[max_error[2]], v1[max_error[2]]
(19.608438476148034,-7.21184238285617)

Julia version is v0.3.3. Maybe it is connected to following warning that is printed when the package is loaded:

julia> using ParallelSparseMatMul
Warning: could not import Base.A_mul_B into ParallelSparseMatMul

Also the problem only occurs if there are multiple cores used. If single core is used then the error is zero.

madeleineudell commented 9 years ago

The problem here is (almost certainly) that while shared arrays lock on writes, there is no atomic read-write lock, so this line is not thread safe:

y[rv[k]] += nzv[k]*alphax

madeleineudell commented 9 years ago

The problem with += is if

Proc 1 reads y[1] = 1
Proc 2 reads y[1] = 1
Proc 1 wants to add 1 to y[1], so sets y[1] = 2
Proc 2 wants to add 1 to y[1], so sets y[1] = 2

...whereas the correct answer (1+1+1) is 3.

This means multiplication (but not transpose multiplication, which doesn't use +=) of shared sparse matrices will not be thread safe.

The way to fix it is to introduce a separate result vector y for each core and to add them up in serial at the end.

sbromberger commented 9 years ago

See https://github.com/madeleineudell/ParallelSparseMatMul.jl/pull/7#issuecomment-138123176

madeleineudell / ParallelSparseMatMul.jl

non-correct sparse matrix vector result #6