Parallelization of fast_csr_mv

openopt / copt

A Python library for mathematical optimization

http://openo.pt/copt

Other

135 stars 35 forks source link

Parallelization of fast_csr_mv #67

Open GeoffNN opened 4 years ago

GeoffNN commented 4 years ago

Parallelization for the sparse matrix multiplication, row-wise. Cf this thread, which announces nice speed ups.

fabianp commented 4 years ago

I'm running the examples code with and without this patch, and the 12 CPUs I have are all constantly at 100% , and don't see any speed improvements after using this patch. It seems that numba (or llvm) is by default already deciding to parallelize some parts of the code

fabianp commented 4 years ago

I'm going to run a couple more benchmarks, but if the performance is the same I would be more inclined to let numba decide which parts to parallelize, as he'll likely do a better job than us at that (thinking for example at nested parallelizable loops) ;-)

fabianp commented 4 years ago

and the parallelization I observe definitely comes from numba, as I can get it down to use just one CPU with the environment variable NUMBA_NUM_THREADS=1

GeoffNN commented 4 years ago

For now, the only parallelization that's done in this part of the code base is for sampling the batches, once per epoch. Is it possible that the 100% on the CPUs (that I also observe on my machines) is because there's no deallocation between two calls to that function?

What happens if we remove parallel=True for sampling the batches, and put it for the matrix multiplication?

fabianp commented 4 years ago

OK I think I found why I was seeing all CPUs being used. It's because the bottleneck of the algorithm is not in the algorithm itself but in computing the fw_gap used for reporting. This uses the full gradient, and so Numpy's matrix-vector routines that fire up all CPUs

GeoffNN commented 3 years ago

Does that mean that it would be nice in practice ? Since for applications, you won't be computing the gap very often.