Open claudio-tw opened 1 year ago
I should also add that I have tried to reproduce the issue "synthetically". In the script profiling/lar_comparison.py, I profile calls to the functions implementing the Least Angle Regression. I do so in three different ways: plain call to lar.solve passing numpy arrays; call to lar.solve passing numpy arrays that have just been converted from CuPy's arrays; call to the equivalent function in pyHSICLasso. This did not reproduce the issue that I am facing with profiling/select_profile.py.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 15.181 15.181 {built-in method builtins.exec}
1 0.000 0.000 15.181 15.181 <string>:1(<module>)
1 0.000 0.000 15.181 15.181 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:28(run_hisel)
1 15.045 15.045 15.181 15.181 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
1413 0.005 0.000 0.106 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
100 0.000 0.000 0.083 0.001 <__array_function__ internals>:177(lstsq)
100 0.078 0.001 0.083 0.001 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:2150(lstsq)
100 0.001 0.000 0.031 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
100 0.002 0.000 0.030 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
100 0.001 0.000 0.009 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:13(_broadcast_arrays)
100 0.001 0.000 0.008 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:301(_set_arrayXarray)
100 0.000 0.000 0.008 0.000 <__array_function__ internals>:177(broadcast_arrays)
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 15.746 15.746 {built-in method builtins.exec}
1 0.001 0.001 15.746 15.746 <string>:1(<module>)
1 0.000 0.000 15.745 15.745 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:31(run_hisel_from_cupy_arrays)
1 15.492 15.492 15.644 15.644 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
1413 0.005 0.000 0.122 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
2 0.000 0.000 0.101 0.050 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/cupy/__init__.py:795(asnumpy)
2 0.101 0.050 0.101 0.050 {method 'get' of 'cupy._core.core._ndarray_base' objects}
100 0.000 0.000 0.099 0.001 <__array_function__ internals>:177(lstsq)
100 0.094 0.001 0.098 0.001 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:2150(lstsq)
100 0.001 0.000 0.031 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
100 0.003 0.000 0.030 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
100 0.001 0.000 0.009 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:13(_broadcast_arrays)
100 0.000 0.000 0.009 0.000 <__array_function__ internals>:177(broadcast_arrays)
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 16.587 16.587 {built-in method builtins.exec}
1 0.000 0.000 16.587 16.587 <string>:1(<module>)
1 0.023 0.023 16.587 16.587 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:36(run_pyhsiclasso)
1 9.619 9.619 16.564 16.564 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/pyHSICLasso/nlars.py:17(nlars)
1713 6.475 0.004 6.770 0.004 {built-in method numpy.core._multiarray_umath.implement_array_function}
503 0.003 0.000 6.475 0.013 <__array_function__ internals>:177(dot)
100 0.000 0.000 0.284 0.003 <__array_function__ internals>:177(solve)
100 0.279 0.003 0.283 0.003 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:306(solve)
100 0.049 0.000 0.049 0.000 {built-in method builtins.min}
101 0.029 0.000 0.045 0.000 {built-in method builtins.sorted}
100 0.042 0.000 0.042 0.000 {built-in method builtins.max}
100 0.001 0.000 0.031 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
100 0.002 0.000 0.030 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
50000 0.016 0.000 0.016 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/pyHSICLasso/nlars.py:132(<lambda>)
100 0.001 0.000 0.008 0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:301(_set_arrayXarray)
hisel
implements the HSIC Lasso algorithm of Yamada, M. et al. (2012). The algorithm consists of two parts: computation of Gram matrices and Least Angle Regression. I implemented the computation of Gram matrices in a more vectorized way than other existing implementations of the same algorithm (like pyHSICLasso). This yields some performance benefits. On top of that, I supported GPU acceleration using CuPy. This gives nice speedups - see the screenshot below. GPU-acceleration gives roughly 3x speedup on the computation of Gram matrices. However, when I assess the performance of the overall algo (Gram + LAR), it seems that this speedup is lost. I do not understand why. These are the top most expensive function calls of the CPU run:And these are the most expensive function calls of the GPU run:
The computation of the Gram matrices is done via the call to apply_feature_map. You can see that the CPU run took 15.37 seconds, whereas the GPU run took 5.15 seconds. However, the overall algo (Gram + LAR) took 63 seconds on CPU and 100 seconds on GPU. The GPU run has to move tensors back from GPU to the CPU, and this causes some overhead: you can see it from the 48 calls to the CuPy function asnumpy, but these 48 calls collectively took 2.57 seconds, which does not explain the loss in performance. What explains the performance loss is the call to lar.solve: it took 47.17 seconds after computing the Gram matrices on CPU, and it took 94.14 seconds after computing the Gram matrices on GPU. I do not understand this. The function should be doing exactly the same thing in either of the runs, irrespectively of whether the Gram matrices passed to it were computed on CPU or on GPU (the move from GPU to CPU happens before the call to lar.solve). Do you have an idea of why this issue occurs? Can you help resolve the bottleneck with lar.solve? The profiling that I am reporting here was obtained using the script profiling/select_profile.py.