inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
588 stars 73 forks source link

Codegen slow due to excessive islpy calls #148

Open kaushikcfd opened 4 years ago

kaushikcfd commented 4 years ago

View the loopy kernel on Gist.

Profile log, as reported by cProfile, links 90% of the codegen time to islpy routines:

         3024863 function calls (2382326 primitive calls) in 605.951 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
790411/302110  477.026    0.001  477.870    0.002 islpy/islpy/__init__.py:861(wrapper)
      265   83.808    0.316   83.884    0.317 islpy/islpy/__init__.py:1014(obj_eliminate_except)
18792/17160   11.141    0.001   11.482    0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
       68    8.268    0.122   19.432    0.286 loopy/loopy/symbolic.py:1884(get_access_range)
       74    6.624    0.090  233.594    3.157 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
       74    6.144    0.083  233.079    3.150 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
     4608    2.248    0.000    8.858    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1070    1.455    0.001    2.364    0.002 loopy/loopy/symbolic.py:1590(__init__)
  973/897    0.917    0.001    1.262    0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
    35845    0.646    0.000    0.646    0.000 {built-in method islpy._isl.from_aff}
    49830    0.596    0.000    1.861    0.000 islpy/islpy/__init__.py:1096(_set_dim_id)
inducer commented 4 years ago

Thanks for timing this! A few thoughts:

inducer commented 4 years ago

I'll further assume that it's time genuinely spent in isl and not in the wrapper. (Do you have any indication otherwise?)

Yes: https://github.com/inducer/islpy/issues/28

kaushikcfd commented 4 years ago

From the discussing at inducer/islpy#28, it was established that ISL's performance depends on whether it was linked against GMP or imath (shipped with islpy). Running the same kernel with islpy linked with GMP gets a 3x speed-up! :)

Here's the corresponding log:

         2897406 function calls (2299915 primitive calls) in 220.275 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
729855/286691  164.491    0.000  165.192    0.001 islpy/islpy/__init__.py:861(wrapper)
      265   25.103    0.095   25.174    0.095 islpy/islpy/__init__.py:1014(obj_eliminate_except)
17736/16152   10.725    0.001   11.068    0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
       68    3.318    0.049    8.032    0.118 loopy/loopy/symbolic.py:1884(get_access_range)
       74    2.561    0.035   82.331    1.113 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
       74    2.357    0.032   82.178    1.111 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
     4632    1.929    0.000    7.909    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1070    1.403    0.001    2.342    0.002 loopy/loopy/symbolic.py:1590(__init__)
  973/897    0.937    0.001    1.296    0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
    33697    0.732    0.000    0.732    0.000 {built-in method islpy._isl.from_aff}
    46890    0.593    0.000    1.867    0.000 islpy/islpy/__init__.py:1096(_set_dim_id)
    27169    0.508    0.000    0.871    0.000 islpy/islpy/__init__.py:850(wrapper)
     3347    0.500    0.000    0.834    0.000 islpy/islpy/__init__.py:690(_number_to_expr_like)
    22345    0.419    0.000    0.419    0.000 islpy/islpy/__init__.py:548(set_get_basic_sets)
    10367    0.380    0.000    2.568    0.000 islpy/islpy/__init__.py:315(space_get_var_dict)
      148    0.353    0.002   11.606    0.078 loopy/loopy/isl_helpers.py:550(_find_noninteracting_dims)
    21703    0.311    0.000    0.311    0.000 islpy/islpy/__init__.py:805(val_to_python)

There's clearly more stuff to be fixed. (exploring if we are unnecessarily invoking ISL routines anywhere.)

inducer commented 4 years ago

cc @wence- (who may be interested in ways to speed up loopy, but see the license concerns at https://github.com/inducer/islpy/issues/28)

kaushikcfd commented 4 years ago

Post #149, #150 and taking suggestions from inducer/islpy#28, the codegen profile looks better:

         3501511 function calls (2873484 primitive calls) in 71.862 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    30039   26.195    0.001   26.610    0.001 islpy/islpy/__init__.py:850(wrapper)
735171/248679   19.323    0.000   20.050    0.000 islpy/islpy/__init__.py:861(wrapper)
       48    7.116    0.148    7.129    0.149 islpy/islpy/__init__.py:1014(obj_eliminate_except)
       94    4.831    0.051   11.872    0.126 loopy/loopy/symbolic.py:1652(get_access_range)
     21/2    2.394    0.114   54.000   27.000 loopy/loopy/codegen/loop.py:347(generate_sequential_loop_dim_code)
     5073    2.009    0.000    8.444    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1200    1.560    0.001    2.617    0.002 loopy/loopy/symbolic.py:1354(__init__)
 1005/925    1.071    0.001    1.460    0.002 loopy/loopy/symbolic.py:1289(aff_to_expr)

We are still doing around 0.3M ISL calls which is not ideal, but I'm fine with closing this issue.

inducer commented 4 years ago

Let's leave it open for a bit. I think there's more we can do.

nchristensen commented 11 months ago

Profile guided optimization while compiling isl might modestly improve performance.