Performance bottleneck in Least Angle Regression

hisel implements the HSIC Lasso algorithm of Yamada, M. et al. (2012). The algorithm consists of two parts: computation of Gram matrices and Least Angle Regression. I implemented the computation of Gram matrices in a more vectorized way than other existing implementations of the same algorithm (like pyHSICLasso). This yields some performance benefits. On top of that, I supported GPU acceleration using CuPy. This gives nice speedups - see the screenshot below. GPU-acceleration gives roughly 3x speedup on the computation of Gram matrices. However, when I assess the performance of the overall algo (Gram + LAR), it seems that this speedup is lost. I do not understand why. These are the top most expensive function calls of the CPU run:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   63.916   63.916 {built-in method builtins.exec}
        1    0.000    0.000   63.916   63.916 <string>:1(<module>)
        1    0.000    0.000   63.916   63.916 /home/tw.com/claudio.bellani/projects/hisel/profiling/select_profile.py:89(run_on_cpu)
        1    0.000    0.000   63.916   63.916 /home/tw.com/claudio.bellani/projects/hisel/profiling/select_profile.py:95(_run)
        1    0.001    0.001   63.914   63.914 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:191(select)
        1    0.033    0.033   63.913   63.913 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:138(projection_matrix)
        1    0.675    0.675   63.861   63.861 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:410(_run)
        1   47.714   47.714   47.741   47.741 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
        2    0.000    0.000   15.373    7.686 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:238(apply_feature_map)
        2    0.000    0.000   14.045    7.023 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:253(<listcomp>)
       48    0.001    0.000   14.037    0.292 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:205(_run_batch)
       24    0.034    0.001   11.178    0.466 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:20(featwise)
       24   10.094    0.421   11.144    0.464 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:60(_rbf_featwise)
  833/763    1.333    0.002    3.692    0.005 {built-in method numpy.core._multiarray_umath.implement_array_function}
       48    1.576    0.033    2.791    0.058 /home/tw.com/claudio.bellani/projects/hisel/hisel/kernels.py:197(_center_gram)
        6    0.000    0.000    1.331    0.222 <__array_function__ internals>:177(vstack)

And these are the most expensive function calls of the GPU run:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  100.149  100.149 {built-in method builtins.exec}
        1    0.000    0.000  100.149  100.149 <string>:1(<module>)
        1    0.000    0.000  100.149  100.149 /home/tw.com/claudio.bellani/projects/hisel/profiling/select_profile.py:92(run_on_gpu)
        1    0.000    0.000  100.149  100.149 /home/tw.com/claudio.bellani/projects/hisel/profiling/select_profile.py:95(_run)
        1    0.000    0.000  100.145  100.145 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:191(select)
        1    0.031    0.031  100.145  100.145 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:138(projection_matrix)
        1    0.659    0.659  100.068  100.068 /home/tw.com/claudio.bellani/projects/hisel/hisel/select.py:410(_run)
        1   94.147   94.147   94.175   94.175 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
        2    0.000    0.000    5.154    2.577 /home/tw.com/claudio.bellani/projects/hisel/hisel/cudakernels.py:233(apply_feature_map)
        2    0.000    0.000    3.634    1.817 /home/tw.com/claudio.bellani/projects/hisel/hisel/cudakernels.py:247(<listcomp>)
       48    0.000    0.000    3.627    0.076 /home/tw.com/claudio.bellani/projects/hisel/hisel/cudakernels.py:202(_run_batch)
       48    0.000    0.000    2.572    0.054 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/cupy/__init__.py:795(asnumpy)
       48    2.572    0.054    2.572    0.054 {method 'get' of 'cupy._core.core._ndarray_base' objects}
  769/751    1.326    0.002    1.429    0.002 {built-in method numpy.core._multiarray_umath.implement_array_function}
        6    0.000    0.000    1.324    0.221 <__array_function__ internals>:177(vstack)
        6    0.000    0.000    1.324    0.221 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/core/shape_base.py:223(vstack)
        6    0.000    0.000    1.323    0.221 <__array_function__ internals>:177(concatenate)
       24    0.000    0.000    1.024    0.043 /home/tw.com/claudio.bellani/projects/hisel/hisel/cudakernels.py:17(featwise)

The computation of the Gram matrices is done via the call to apply_feature_map. You can see that the CPU run took 15.37 seconds, whereas the GPU run took 5.15 seconds. However, the overall algo (Gram + LAR) took 63 seconds on CPU and 100 seconds on GPU. The GPU run has to move tensors back from GPU to the CPU, and this causes some overhead: you can see it from the 48 calls to the CuPy function asnumpy, but these 48 calls collectively took 2.57 seconds, which does not explain the loss in performance. What explains the performance loss is the call to lar.solve: it took 47.17 seconds after computing the Gram matrices on CPU, and it took 94.14 seconds after computing the Gram matrices on GPU. I do not understand this. The function should be doing exactly the same thing in either of the runs, irrespectively of whether the Gram matrices passed to it were computed on CPU or on GPU (the move from GPU to CPU happens before the call to lar.solve). Do you have an idea of why this issue occurs? Can you help resolve the bottleneck with lar.solve? The profiling that I am reporting here was obtained using the script profiling/select_profile.py.

I should also add that I have tried to reproduce the issue "synthetically". In the script profiling/lar_comparison.py, I profile calls to the functions implementing the Least Angle Regression. I do so in three different ways: plain call to lar.solve passing numpy arrays; call to lar.solve passing numpy arrays that have just been converted from CuPy's arrays; call to the equivalent function in pyHSICLasso. This did not reproduce the issue that I am facing with profiling/select_profile.py.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   15.181   15.181 {built-in method builtins.exec}
        1    0.000    0.000   15.181   15.181 <string>:1(<module>)
        1    0.000    0.000   15.181   15.181 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:28(run_hisel)
        1   15.045   15.045   15.181   15.181 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
     1413    0.005    0.000    0.106    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
      100    0.000    0.000    0.083    0.001 <__array_function__ internals>:177(lstsq)
      100    0.078    0.001    0.083    0.001 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:2150(lstsq)
      100    0.001    0.000    0.031    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
      100    0.002    0.000    0.030    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
      100    0.001    0.000    0.009    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:13(_broadcast_arrays)
      100    0.001    0.000    0.008    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:301(_set_arrayXarray)
      100    0.000    0.000    0.008    0.000 <__array_function__ internals>:177(broadcast_arrays)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   15.746   15.746 {built-in method builtins.exec}
        1    0.001    0.001   15.746   15.746 <string>:1(<module>)
        1    0.000    0.000   15.745   15.745 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:31(run_hisel_from_cupy_arrays)
        1   15.492   15.492   15.644   15.644 /home/tw.com/claudio.bellani/projects/hisel/hisel/lar/lar.py:9(solve)
     1413    0.005    0.000    0.122    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
        2    0.000    0.000    0.101    0.050 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/cupy/__init__.py:795(asnumpy)
        2    0.101    0.050    0.101    0.050 {method 'get' of 'cupy._core.core._ndarray_base' objects}
      100    0.000    0.000    0.099    0.001 <__array_function__ internals>:177(lstsq)
      100    0.094    0.001    0.098    0.001 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:2150(lstsq)
      100    0.001    0.000    0.031    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
      100    0.003    0.000    0.030    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
      100    0.001    0.000    0.009    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:13(_broadcast_arrays)
      100    0.000    0.000    0.009    0.000 <__array_function__ internals>:177(broadcast_arrays)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   16.587   16.587 {built-in method builtins.exec}
        1    0.000    0.000   16.587   16.587 <string>:1(<module>)
        1    0.023    0.023   16.587   16.587 /home/tw.com/claudio.bellani/projects/hisel/profiling/lar_comparison.py:36(run_pyhsiclasso)
        1    9.619    9.619   16.564   16.564 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/pyHSICLasso/nlars.py:17(nlars)
     1713    6.475    0.004    6.770    0.004 {built-in method numpy.core._multiarray_umath.implement_array_function}
      503    0.003    0.000    6.475    0.013 <__array_function__ internals>:177(dot)
      100    0.000    0.000    0.284    0.003 <__array_function__ internals>:177(solve)
      100    0.279    0.003    0.283    0.003 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/numpy/linalg/linalg.py:306(solve)
      100    0.049    0.000    0.049    0.000 {built-in method builtins.min}
      101    0.029    0.000    0.045    0.000 {built-in method builtins.sorted}
      100    0.042    0.000    0.042    0.000 {built-in method builtins.max}
      100    0.001    0.000    0.031    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:321(__setitem__)
      100    0.002    0.000    0.030    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_index.py:96(__setitem__)
    50000    0.016    0.000    0.016    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/pyHSICLasso/nlars.py:132(<lambda>)
      100    0.001    0.000    0.008    0.000 /home/tw.com/claudio.bellani/anaconda3/envs/hiselcuda/lib/python3.11/site-packages/scipy/sparse/_lil.py:301(_set_arrayXarray)

transferwise / hisel

Performance bottleneck in Least Angle Regression #44