Open GeoffNN opened 4 years ago
I'm running the examples code with and without this patch, and the 12 CPUs I have are all constantly at 100% , and don't see any speed improvements after using this patch. It seems that numba (or llvm) is by default already deciding to parallelize some parts of the code
I'm going to run a couple more benchmarks, but if the performance is the same I would be more inclined to let numba decide which parts to parallelize, as he'll likely do a better job than us at that (thinking for example at nested parallelizable loops) ;-)
and the parallelization I observe definitely comes from numba, as I can get it down to use just one CPU with the environment variable NUMBA_NUM_THREADS=1
For now, the only parallelization that's done in this part of the code base is for sampling the batches, once per epoch. Is it possible that the 100% on the CPUs (that I also observe on my machines) is because there's no deallocation between two calls to that function?
What happens if we remove parallel=True for sampling the batches, and put it for the matrix multiplication?
OK I think I found why I was seeing all CPUs being used. It's because the bottleneck of the algorithm is not in the algorithm itself but in computing the fw_gap used for reporting. This uses the full gradient, and so Numpy's matrix-vector routines that fire up all CPUs
Does that mean that it would be nice in practice ? Since for applications, you won't be computing the gap very often.
Parallelization for the sparse matrix multiplication, row-wise. Cf this thread, which announces nice speed ups.