inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
588 stars 73 forks source link

Loopy is slow in make_kernel, preprocess_kernel and codegen #247

Open isuruf opened 3 years ago

isuruf commented 3 years ago

Thanks to @kaushikcfd, scheduling is now super fast compared to other parts of loopy. Still, make_kernel, preprocess_kernel, codegen take so much time that some sumpy kernels are unusable.

Here's a small example with https://github.com/isuruf/sumpy/tree/derivtaker

import numpy as np
import sys
import loopy as lp

import pyopencl as cl

from sumpy.expansion.multipole import LaplaceConformingVolumeTaylorMultipoleExpansion
from sumpy.expansion.local import LaplaceConformingVolumeTaylorLocalExpansion
from sumpy.kernel import LaplaceKernel
import sumpy.symbolic as sym

import logging

logger = logging.getLogger(__name__)

try:
    import faulthandler
except ImportError:
    pass
else:
    faulthandler.enable()

knl = LaplaceKernel(3)
local_expn_class = LaplaceConformingVolumeTaylorLocalExpansion
mpole_expn_class = LaplaceConformingVolumeTaylorMultipoleExpansion
order = 12
ctx_factory = cl._csc

logging.basicConfig(level=logging.INFO)

from sympy.core.cache import clear_cache

clear_cache()

ctx = ctx_factory()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

np.random.seed(17)

target_kernels = [knl]

m_expn = mpole_expn_class(knl, order=order)
l_expn = local_expn_class(knl, order=order)

from sumpy import P2EFromSingleBox, E2PFromSingleBox, P2P, E2EFromCSR

m2l = E2EFromCSR(ctx, m_expn, l_expn)

loopy_knl = m2l.get_optimized_kernel()
loopy_knl = lp.add_and_infer_dtypes(
    loopy_knl,
    dict(
        tgt_ibox=np.int32,
        centers=np.float64,
        tgt_center=np.float64,
        target_boxes=np.int32,
        src_ibox=np.int32,
        src_expansions=np.float64,
        tgt_rscale=np.float64,
        src_rscale=np.float64,
        src_box_starts=np.int32,
        src_box_lists=np.int32,
    ),
)
lp.generate_code_v2(loopy_knl)
kaushikcfd commented 3 years ago

This is the log sorted by the cumulative time spent. There doesn't seem to be an obvious low hanging fruit in this case:

         146881842 function calls (140188954 primitive calls) in 97.556 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.019    0.019   96.991   96.991 loopy/loopy/codegen/__init__.py:404(generate_code_v2)
  1588528    1.643    0.000   41.446    0.000 pymbolic/pymbolic/mapper/__init__.py:109(__call__)
        1    0.000    0.000   36.843   36.843 loopy/loopy/schedule/__init__.py:2134(get_one_scheduled_kernel)
        1    0.000    0.000   36.843   36.843 loopy/loopy/schedule/__init__.py:2143(get_one_linearized_kernel)
        1    0.000    0.000   36.842   36.842 loopy/loopy/schedule/__init__.py:2121(_get_one_scheduled_kernel_inner)
        2    0.003    0.001   36.842   18.421 loopy/loopy/schedule/__init__.py:1945(generate_loop_schedules_inner)
        1    0.012    0.012   36.569   36.569 loopy/loopy/preprocess.py:2030(preprocess_kernel)
   107846    0.113    0.000   36.025    0.001 {built-in method builtins.next}
        2    0.000    0.000   35.952   17.976 loopy/loopy/schedule/__init__.py:1929(generate_loop_schedules)
  9335052    2.563    0.000   32.603    0.000 pytools/__init__.py:675(wrapper)
        1    0.000    0.000   30.326   30.326 loopy/loopy/transform/iname.py:1218(wrapper)
        1    0.084    0.084   28.915   28.915 loopy/loopy/preprocess.py:881(realize_reduction)
      507    0.001    0.000   23.352    0.046 loopy/loopy/symbolic.py:1815(map_reduction)
      169    0.002    0.000   23.348    0.138 loopy/loopy/preprocess.py:1690(map_reduction)
      169    0.006    0.000   23.318    0.138 loopy/loopy/preprocess.py:1004(map_reduction_seq)
     7868   14.844    0.002   23.292   11.646 loopy/loopy/schedule/__init__.py:807(generate_loop_schedules_internal)
      169    0.000    0.000   23.278    0.138 loopy/loopy/kernel/tools.py:1655(find_most_recent_global_barrier)
      169    3.350    0.020   23.087    0.137 loopy/loopy/kernel/tools.py:1590(get_global_barrier_order)

If anyone wishes to reproduce this, here is the script.

isuruf commented 3 years ago

Here's the pyinstrument profile,

136.566 <module>  loopy_reproduce.py:1
├─ 73.198 generate_code_v2  loopy/codegen/__init__.py:404
│  ├─ 32.329 preprocess_kernel  loopy/preprocess.py:2030
│  │  ├─ 26.890 wrapper  loopy/transform/iname.py:1218
│  │  │  └─ 25.818 realize_reduction  loopy/preprocess.py:881
│  │  │     ├─ 22.571 __call__  pymbolic/mapper/__init__.py:114
│  │  │     │     [162 frames hidden]  pymbolic
│  │  │     │        21.620 map_reduction  loopy/symbolic.py:1815
│  │  │     │        └─ 21.620 map_reduction  loopy/preprocess.py:1690
│  │  │     │           └─ 21.579 map_reduction_seq  loopy/preprocess.py:1004
│  │  │     │              └─ 21.565 wrapper  pytools/__init__.py:675
│  │  │     │                 └─ 21.563 find_most_recent_global_barrier  loopy/kernel/tools.py:1655
│  │  │     │                    └─ 21.562 wrapper  pytools/__init__.py:675
│  │  │     │                       └─ 21.314 get_global_barrier_order  loopy/kernel/tools.py:1590
│  │  │     │                          ├─ 11.304 compute_topological_order  pytools/graph.py:210
│  │  │     │                          │  ├─ 7.439 [self]  
│  │  │     │                          │  ├─ 1.473 __lt__  pytools/graph.py:206
│  │  │     │                          │  └─ 1.424 dict.get  <built-in>:0
│  │  │     │                          ├─ 3.834 <listcomp>  loopy/kernel/tools.py:1606
│  │  │     │                          │  └─ 3.571 _is_global_barrier  loopy/kernel/tools.py:1583
│  │  │     │                          ├─ 2.835 [self]  
│  │  │     │                          └─ 2.336 <dictcomp>  loopy/kernel/tools.py:1597
│  │  │     └─ 3.122 replace_instruction_ids  loopy/transform/instruction.py:172
│  │  │        └─ 2.455 [self]  
│  │  └─ 1.811 realize_ilp  loopy/preprocess.py:1965
│  │     └─ 1.811 privatize_temporaries_with_inames  loopy/transform/privatize.py:72
│  ├─ 24.813 generate_host_or_device_program  loopy/codegen/result.py:286
│  │  └─ 24.804 build_loop_nest  loopy/codegen/control.py:218
│  │     └─ 24.702 build_insn_group  loopy/codegen/control.py:330
│  │        └─ 24.702 gen_code  loopy/codegen/control.py:456
│  │           └─ 24.702 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │              └─ 24.675 generate_host_or_device_program  loopy/codegen/result.py:286
│  │                 └─ 23.941 set_up_hw_parallel_loops  loopy/codegen/loop.py:231
│  │                    └─ 23.894 set_up_hw_parallel_loops  loopy/codegen/loop.py:231
│  │                       └─ 23.883 build_loop_nest  loopy/codegen/control.py:218
│  │                          └─ 23.807 build_insn_group  loopy/codegen/control.py:330
│  │                             └─ 23.786 gen_code  loopy/codegen/control.py:456
│  │                                └─ 23.786 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                   └─ 23.786 generate_sequential_loop_dim_code  loopy/codegen/loop.py:347
│  │                                      └─ 23.766 build_loop_nest  loopy/codegen/control.py:218
│  │                                         └─ 23.702 build_insn_group  loopy/codegen/control.py:330
│  │                                            └─ 23.447 build_insn_group  loopy/codegen/control.py:330
│  │                                               └─ 23.415 build_insn_group  loopy/codegen/control.py:330
│  │                                                  └─ 22.938 gen_code  loopy/codegen/control.py:456
│  │                                                     └─ 22.938 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                        └─ 22.938 generate_sequential_loop_dim_code  loopy/codegen/loop.py:347
│  │                                                           └─ 22.911 build_loop_nest  loopy/codegen/control.py:218
│  │                                                              └─ 22.466 build_insn_group  loopy/codegen/control.py:330
│  │                                                                 ├─ 13.254 gen_code  loopy/codegen/control.py:456
│  │                                                                 │  └─ 13.243 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                                 │     └─ 13.147 try_vectorized  loopy/codegen/__init__.py:336
│  │                                                                 │        └─ 13.139 <lambda>  loopy/codegen/control.py:170
│  │                                                                 │           └─ 13.138 generate_instruction_code  loopy/codegen/instruction.py:74
│  │                                                                 │              ├─ 11.017 to_codegen_result  loopy/codegen/instruction.py:34
│  │                                                                 │              │  ├─ 6.939 align_two  islpy/__init__.py:1224
│  │                                                                 │              │  │     [220 frames hidden]  islpy
│  │                                                                 │              │  └─ 3.207 wrapper  islpy/__init__.py:911
│  │                                                                 │              │        [68 frames hidden]  islpy
│  │                                                                 │              │           3.078 gist  islpy/_isl.py:59605
│  │                                                                 │              │           └─ 2.801 Lib.isl_set_gist  <built-in>:0
│  │                                                                 │              └─ 2.024 generate_assignment_instruction_code  loopy/codegen/instruction.py:102
│  │                                                                 │                 └─ 1.813 emit_assignment  loopy/target/c/__init__.py:868
│  │                                                                 │                    └─ 1.575 __call__  loopy/target/c/codegen/expression.py:118
│  │                                                                 │                       └─ 1.506 rec  loopy/target/c/codegen/expression.py:110
│  │                                                                 └─ 9.191 build_insn_group  loopy/codegen/control.py:330
│  │                                                                    └─ 9.151 build_insn_group  loopy/codegen/control.py:330
│  │                                                                       └─ 9.110 build_insn_group  loopy/codegen/control.py:330
│  │                                                                          └─ 9.108 gen_code  loopy/codegen/control.py:456
│  │                                                                             └─ 9.106 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                                                └─ 9.034 try_vectorized  loopy/codegen/__init__.py:336
│  │                                                                                   └─ 9.030 <lambda>  loopy/codegen/control.py:170
│  │                                                                                      └─ 9.028 generate_instruction_code  loopy/codegen/instruction.py:74
│  │                                                                                         ├─ 5.083 to_codegen_result  loopy/codegen/instruction.py:34
│  │                                                                                         │  └─ 3.875 align_two  islpy/__init__.py:1224
│  │                                                                                         │        [221 frames hidden]  islpy
│  │                                                                                         └─ 3.887 generate_assignment_instruction_code  loopy/codegen/instruction.py:102
│  │                                                                                            └─ 3.723 emit_assignment  loopy/target/c/__init__.py:868
│  │                                                                                               └─ 3.603 __call__  loopy/target/c/codegen/expression.py:118
│  │                                                                                                  └─ 3.572 rec  loopy/target/c/codegen/expression.py:110
│  │                                                                                                     ├─ 2.043 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                                                     │     [2 frames hidden]  pymbolic
│  │                                                                                                     │        1.668 map_sum  loopy/target/c/codegen/expression.py:561
│  │                                                                                                     │        └─ 1.667 base_impl  loopy/target/c/codegen/expression.py:562
│  │                                                                                                     │           └─ 1.667 map_sum  pymbolic/mapper/__init__.py:398
│  │                                                                                                     │                 [16 frames hidden]  pymbolic
│  │                                                                                                     │                    1.599 <genexpr>  pymbolic/mapper/__init__.py:401
│  │                                                                                                     │                    └─ 1.578 rec  loopy/target/c/codegen/expression.py:110
│  │                                                                                                     │                       └─ 1.564 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                                                     │                             [2 frames hidden]  pymbolic
│  │                                                                                                     │                                1.526 map_product  loopy/target/c/codegen/expression.py:610
│  │                                                                                                     │                                └─ 1.518 base_impl  loopy/target/c/codegen/expression.py:611
│  │                                                                                                     │                                   └─ 1.494 map_product  pymbolic/mapper/__init__.py:403
│  │                                                                                                     │                                         [32 frames hidden]  pymbolic
│  │                                                                                                     └─ 1.514 infer_type  loopy/target/c/codegen/expression.py:78
│  │                                                                                                        └─ 1.483 __call__  loopy/type_inference.py:60
│  │                                                                                                           └─ 1.472 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                                                                 [2 frames hidden]  pymbolic
│  │                                                                                                                    1.468 map_sum  loopy/type_inference.py:170
│  └─ 14.247 get_one_scheduled_kernel  loopy/schedule/__init__.py:2134
│     └─ 14.247 get_one_linearized_kernel  loopy/schedule/__init__.py:2143
│        └─ 14.246 _get_one_scheduled_kernel_inner  loopy/schedule/__init__.py:2121
│           └─ 14.206 generate_loop_schedules  loopy/schedule/__init__.py:1929
│              └─ 14.206 generate_loop_schedules_inner  loopy/schedule/__init__.py:1945
│                 ├─ 10.221 pre_schedule_checks  loopy/check.py:799
│                 │  ├─ 5.407 check_variable_access_ordered  loopy/check.py:762
│                 │  │  └─ 5.407 _check_variable_access_ordered_inner  loopy/check.py:604
│                 │  │     └─ 3.656 do_access_ranges_overlap_conservative  loopy/symbolic.py:2194
│                 │  │        └─ 2.114 _get_access_range_for_var  loopy/symbolic.py:2179
│                 │  │           └─ 1.982 wrapper  pytools/__init__.py:675
│                 │  │              └─ 1.930 _get_access_ranges  loopy/symbolic.py:2154
│                 │  │                 └─ 1.819 __call__  pymbolic/mapper/__init__.py:114
│                 │  │                       [2 frames hidden]  pymbolic
│                 │  │                          1.817 map_subscript  loopy/symbolic.py:2049
│                 │  │                          └─ 1.765 get_access_map  loopy/symbolic.py:1906
│                 │  ├─ 1.720 check_for_integer_subscript_indices  loopy/check.py:114
│                 │  │  └─ 1.679 __call__  loopy/type_inference.py:60
│                 │  │     └─ 1.671 __call__  pymbolic/mapper/__init__.py:114
│                 │  │           [3 frames hidden]  pymbolic
│                 │  │              1.650 map_sum  loopy/type_inference.py:170
│                 │  │              └─ 1.448 __call__  pymbolic/mapper/__init__.py:114
│                 │  │                    [2 frames hidden]  pymbolic
│                 │  └─ 1.610 check_bounds  loopy/check.py:460
│                 └─ 2.507 insert_barriers  loopy/schedule/__init__.py:1776
│                    └─ 2.125 insert_barriers  loopy/schedule/__init__.py:1776
│                       └─ 1.438 insert_barriers_at_outer_level  loopy/schedule/__init__.py:1789
├─ 52.686 get_optimized_kernel  sumpy/e2e.py:127
│  ├─ 47.370 get_kernel  sumpy/e2e.py:146
│  │  ├─ 25.594 make_kernel  loopy/kernel/creation.py:1821
│  │  │  ├─ 7.046 duplicate_inames  loopy/transform/iname.py:818
│  │  │  │  ├─ 3.649 map_kernel  loopy/symbolic.py:995
│  │  │  │  │  └─ 3.645 <listcomp>  loopy/symbolic.py:1000
│  │  │  │  │     └─ 3.568 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │  │        └─ 2.990 <lambda>  loopy/symbolic.py:1002
│  │  │  │  │           └─ 2.975 __call__  loopy/symbolic.py:981
│  │  │  │  │              └─ 2.742 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │  │                    [184 frames hidden]  pymbolic
│  │  │  │  └─ 3.377 finish_kernel  loopy/symbolic.py:899
│  │  │  │     └─ 3.376 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│  │  │  │        └─ 3.376 <listcomp>  loopy/symbolic.py:792
│  │  │  │           └─ 3.361 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │              └─ 2.770 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                    [171 frames hidden]  pymbolic
│  │  │  ├─ 4.475 fix_parameters  loopy/transform/parameter.py:134
│  │  │  │  └─ 4.475 _fix_parameter  loopy/transform/parameter.py:67
│  │  │  │     ├─ 2.478 map_kernel  loopy/symbolic.py:995
│  │  │  │     │  └─ 2.203 <listcomp>  loopy/symbolic.py:1000
│  │  │  │     │     └─ 2.192 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │     │        └─ 1.891 <lambda>  loopy/symbolic.py:1002
│  │  │  │     │           └─ 1.882 __call__  loopy/symbolic.py:981
│  │  │  │     │              └─ 1.780 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │     │                    [163 frames hidden]  pymbolic
│  │  │  │     └─ 1.703 finish_kernel  loopy/symbolic.py:899
│  │  │  │        └─ 1.703 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│  │  │  │           └─ 1.703 <listcomp>  loopy/symbolic.py:792
│  │  │  │              └─ 1.693 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │                 └─ 1.394 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                       [168 frames hidden]  pymbolic
│  │  │  ├─ 2.530 determine_shapes_of_temporaries  loopy/kernel/creation.py:1512
│  │  │  │  └─ 1.912 find_shapes_of_vars  loopy/kernel/creation.py:1463
│  │  │  │     └─ 1.880 feed_all_expressions  loopy/kernel/creation.py:1523
│  │  │  │        └─ 1.871 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │           └─ 1.587 <lambda>  loopy/kernel/creation.py:1526
│  │  │  │              └─ 1.584 run_through_armap  loopy/kernel/creation.py:1469
│  │  │  │                 └─ 1.564 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                       [218 frames hidden]  pymbolic
│  │  │  ├─ 2.095 __init__  loopy/kernel/creation.py:1080
│  │  │  ├─ 1.769 guess_arg_shape_if_requested  loopy/kernel/creation.py:1610
│  │  │  │  └─ 1.769 guess_var_shape  loopy/kernel/tools.py:985
│  │  │  │     └─ 1.758 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │        └─ 1.453 run_through_armap  loopy/kernel/tools.py:992
│  │  │  ├─ 1.690 guess_kernel_args_if_requested  loopy/kernel/creation.py:1170
│  │  │  │  └─ 1.670 make_new_arg  loopy/kernel/creation.py:1132
│  │  │  │     └─ 1.670 find_index_rank  loopy/kernel/creation.py:1116
│  │  │  │        └─ 1.660 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │           └─ 1.392 run_irf  loopy/kernel/creation.py:1119
│  │  │  │              └─ 1.368 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                    [220 frames hidden]  pymbolic
│  │  │  └─ 1.464 expand_cses  loopy/kernel/creation.py:1321
│  │  └─ 21.709 get_translation_loopy_insns  sumpy/e2e.py:91
│  │     ├─ 16.169 to_loopy_insns  sumpy/codegen.py:679
│  │     │  ├─ 8.331 <listcomp>  sumpy/codegen.py:731
│  │     │  │  └─ 7.319 convert_expr  sumpy/codegen.py:712
│  │     │  │     └─ 7.236 __call__  pymbolic/mapper/__init__.py:114
│  │     │  │           [187 frames hidden]  pymbolic
│  │     │  └─ 5.620 kill_trivial_assignments  sumpy/codegen.py:161
│  │     │     ├─ 2.872 substitute  pymbolic/mapper/substitutor.py:72
│  │     │     │     [212 frames hidden]  pymbolic
│  │     │     │        1.436 dict.copy  <built-in>:0
│  │     │     └─ 1.480 make_one_step_subst  sumpy/codegen.py:78
│  │     └─ 4.305 run_global_cse  sumpy/assignment_collection.py:164
│  │        └─ 4.291 cse  sumpy/cse.py:550
│  │           └─ 3.400 opt_cse  sumpy/cse.py:357
│  │              └─ 2.921 match_common_args  sumpy/cse.py:266
│  └─ 5.299 split_iname  loopy/transform/iname.py:334
│     └─ 5.294 _split_iname_backend  loopy/transform/iname.py:211
│        ├─ 2.243 map_kernel  loopy/symbolic.py:995
│        │  └─ 1.868 <listcomp>  loopy/symbolic.py:1000
│        │     └─ 1.853 with_transformed_expressions  loopy/kernel/instruction.py:872
│        │        └─ 1.560 <lambda>  loopy/symbolic.py:1002
│        │           └─ 1.552 __call__  loopy/symbolic.py:981
│        │              └─ 1.442 __call__  pymbolic/mapper/__init__.py:114
│        │                    [161 frames hidden]  pymbolic
│        └─ 1.687 finish_kernel  loopy/symbolic.py:899
│           └─ 1.687 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│              └─ 1.687 <listcomp>  loopy/symbolic.py:792
│                 └─ 1.673 with_transformed_expressions  loopy/kernel/instruction.py:872
│                    └─ 1.375 __call__  pymbolic/mapper/__init__.py:114
│                          [173 frames hidden]  pymbolic
└─ 8.944 add_and_infer_dtypes  loopy/kernel/tools.py:106
   └─ 8.937 infer_unknown_types  loopy/type_inference.py:485
      ├─ 4.969 <dictcomp>  loopy/type_inference.py:527
      │  └─ 4.954 <setcomp>  loopy/type_inference.py:528
      │     └─ 4.709 [self]  
      └─ 3.164 _infer_var_type  loopy/type_inference.py:407
         └─ 1.823 __call__  loopy/type_inference.py:60
            └─ 1.812 __call__  pymbolic/mapper/__init__.py:114
                  [2 frames hidden]  pymbolic
                     1.734 map_sum  loopy/type_inference.py:170
                     └─ 1.523 __call__  pymbolic/mapper/__init__.py:114
                           [2 frames hidden]  pymbolic
isuruf commented 3 years ago

After a couple of improvements to loopy and sumpy (derivtaker branch) pyinstrument output is now,

84.866 <module>  loopy_reproduce.py:1
├─ 39.403 generate_code_v2  loopy/codegen/__init__.py:404
│  ├─ 16.501 generate_host_or_device_program  loopy/codegen/result.py:286
│  │  └─ 16.494 build_loop_nest  loopy/codegen/control.py:218
│  │     └─ 16.419 build_insn_group  loopy/codegen/control.py:330
│  │        └─ 16.419 gen_code  loopy/codegen/control.py:456
│  │           └─ 16.418 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │              └─ 16.401 generate_host_or_device_program  loopy/codegen/result.py:286
│  │                 └─ 15.954 set_up_hw_parallel_loops  loopy/codegen/loop.py:231
│  │                    └─ 15.923 set_up_hw_parallel_loops  loopy/codegen/loop.py:231
│  │                       └─ 15.916 build_loop_nest  loopy/codegen/control.py:218
│  │                          └─ 15.869 build_insn_group  loopy/codegen/control.py:330
│  │                             └─ 15.859 gen_code  loopy/codegen/control.py:456
│  │                                └─ 15.859 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                   └─ 15.859 generate_sequential_loop_dim_code  loopy/codegen/loop.py:347
│  │                                      └─ 15.841 build_loop_nest  loopy/codegen/control.py:218
│  │                                         └─ 15.797 build_insn_group  loopy/codegen/control.py:330
│  │                                            └─ 15.188 build_insn_group  loopy/codegen/control.py:330
│  │                                               └─ 15.155 build_insn_group  loopy/codegen/control.py:330
│  │                                                  └─ 14.651 gen_code  loopy/codegen/control.py:456
│  │                                                     └─ 14.651 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                        └─ 14.650 generate_sequential_loop_dim_code  loopy/codegen/loop.py:347
│  │                                                           └─ 14.629 build_loop_nest  loopy/codegen/control.py:218
│  │                                                              └─ 14.391 build_insn_group  loopy/codegen/control.py:330
│  │                                                                 ├─ 8.649 build_insn_group  loopy/codegen/control.py:330
│  │                                                                 │  └─ 8.607 build_insn_group  loopy/codegen/control.py:330
│  │                                                                 │     └─ 8.563 build_insn_group  loopy/codegen/control.py:330
│  │                                                                 │        └─ 8.561 gen_code  loopy/codegen/control.py:456
│  │                                                                 │           └─ 8.559 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                                 │              └─ 8.538 try_vectorized  loopy/codegen/__init__.py:336
│  │                                                                 │                 └─ 8.537 <lambda>  loopy/codegen/control.py:170
│  │                                                                 │                    └─ 8.537 generate_instruction_code  loopy/codegen/instruction.py:74
│  │                                                                 │                       ├─ 5.304 to_codegen_result  loopy/codegen/instruction.py:34
│  │                                                                 │                       │  ├─ 3.620 align_two  islpy/__init__.py:1224
│  │                                                                 │                       │  │     [218 frames hidden]  islpy
│  │                                                                 │                       │  └─ 1.243 wrapper  islpy/__init__.py:911
│  │                                                                 │                       │        [63 frames hidden]  islpy
│  │                                                                 │                       │           1.194 gist  islpy/_isl.py:59605
│  │                                                                 │                       │           └─ 1.140 Lib.isl_set_gist  <built-in>:0
│  │                                                                 │                       └─ 3.211 generate_assignment_instruction_code  loopy/codegen/instruction.py:102
│  │                                                                 │                          └─ 3.165 emit_assignment  loopy/target/c/__init__.py:868
│  │                                                                 │                             └─ 3.120 __call__  loopy/target/c/codegen/expression.py:118
│  │                                                                 │                                └─ 3.103 rec  loopy/target/c/codegen/expression.py:110
│  │                                                                 │                                   ├─ 1.642 infer_type  loopy/target/c/codegen/expression.py:78
│  │                                                                 │                                   │  └─ 1.633 __call__  loopy/type_inference.py:60
│  │                                                                 │                                   │     └─ 1.627 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                 │                                   │           [2 frames hidden]  pymbolic
│  │                                                                 │                                   │              1.622 map_sum  loopy/type_inference.py:170
│  │                                                                 │                                   │              └─ 1.511 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                 │                                   │                    [3 frames hidden]  pymbolic
│  │                                                                 │                                   │                       1.432 map_sum  loopy/type_inference.py:170
│  │                                                                 │                                   └─ 1.458 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                 │                                         [2 frames hidden]  pymbolic
│  │                                                                 │                                            1.228 map_sum  loopy/target/c/codegen/expression.py:561
│  │                                                                 │                                            └─ 1.226 base_impl  loopy/target/c/codegen/expression.py:562
│  │                                                                 │                                               └─ 1.226 map_sum  pymbolic/mapper/__init__.py:398
│  │                                                                 │                                                     [17 frames hidden]  pymbolic
│  │                                                                 │                                                        1.143 <genexpr>  pymbolic/mapper/__init__.py:401
│  │                                                                 │                                                        └─ 1.126 rec  loopy/target/c/codegen/expression.py:110
│  │                                                                 │                                                           └─ 1.111 __call__  pymbolic/mapper/__init__.py:114
│  │                                                                 │                                                                 [2 frames hidden]  pymbolic
│  │                                                                 │                                                                    1.072 map_product  loopy/target/c/codegen/expression.py:610
│  │                                                                 │                                                                    └─ 1.058 base_impl  loopy/target/c/codegen/expression.py:611
│  │                                                                 │                                                                       └─ 1.045 map_product  pymbolic/mapper/__init__.py:403
│  │                                                                 │                                                                             [32 frames hidden]  pymbolic
│  │                                                                 └─ 5.729 gen_code  loopy/codegen/control.py:456
│  │                                                                    └─ 5.725 generate_code_for_sched_index  loopy/codegen/control.py:67
│  │                                                                       └─ 5.707 try_vectorized  loopy/codegen/__init__.py:336
│  │                                                                          └─ 5.707 <lambda>  loopy/codegen/control.py:170
│  │                                                                             └─ 5.707 generate_instruction_code  loopy/codegen/instruction.py:74
│  │                                                                                └─ 5.170 to_codegen_result  loopy/codegen/instruction.py:34
│  │                                                                                   ├─ 3.086 align_two  islpy/__init__.py:1224
│  │                                                                                   │     [219 frames hidden]  islpy
│  │                                                                                   └─ 1.756 wrapper  islpy/__init__.py:911
│  │                                                                                         [50 frames hidden]  islpy
│  │                                                                                            1.716 gist  islpy/_isl.py:59605
│  │                                                                                            └─ 1.686 Lib.isl_set_gist  <built-in>:0
│  ├─ 11.540 get_one_scheduled_kernel  loopy/schedule/__init__.py:2134
│  │  └─ 11.540 get_one_linearized_kernel  loopy/schedule/__init__.py:2143
│  │     └─ 11.539 _get_one_scheduled_kernel_inner  loopy/schedule/__init__.py:2121
│  │        └─ 11.503 generate_loop_schedules  loopy/schedule/__init__.py:1929
│  │           └─ 11.503 generate_loop_schedules_inner  loopy/schedule/__init__.py:1945
│  │              ├─ 9.040 pre_schedule_checks  loopy/check.py:799
│  │              │  ├─ 5.438 check_variable_access_ordered  loopy/check.py:762
│  │              │  │  └─ 5.438 _check_variable_access_ordered_inner  loopy/check.py:604
│  │              │  │     ├─ 3.505 do_access_ranges_overlap_conservative  loopy/symbolic.py:2194
│  │              │  │     │  ├─ 2.047 _get_access_range_for_var  loopy/symbolic.py:2179
│  │              │  │     │  │  └─ 1.882 wrapper  pytools/__init__.py:675
│  │              │  │     │  │     └─ 1.824 _get_access_ranges  loopy/symbolic.py:2154
│  │              │  │     │  │        └─ 1.725 __call__  pymbolic/mapper/__init__.py:114
│  │              │  │     │  │              [4 frames hidden]  pymbolic
│  │              │  │     │  │                 1.723 map_subscript  loopy/symbolic.py:2049
│  │              │  │     │  │                 └─ 1.695 get_access_map  loopy/symbolic.py:1906
│  │              │  │     │  │                    └─ 0.969 guarded_aff_from_expr  loopy/symbolic.py:1514
│  │              │  │     │  │                       └─ 0.965 with_aff_conversion_guard  loopy/symbolic.py:1492
│  │              │  │     │  │                          └─ 0.891 aff_from_expr  loopy/symbolic.py:1473
│  │              │  │     │  │                             └─ 0.870 pwaff_from_expr  loopy/symbolic.py:1488
│  │              │  │     │  └─ 1.257 obj_and  islpy/__init__.py:295
│  │              │  │     │        [38 frames hidden]  islpy
│  │              │  │     └─ 0.968 discard_dep_reqs_in_order  loopy/check.py:663
│  │              │  ├─ 1.354 check_for_integer_subscript_indices  loopy/check.py:114
│  │              │  │  └─ 1.321 __call__  loopy/type_inference.py:60
│  │              │  │     └─ 1.315 __call__  pymbolic/mapper/__init__.py:114
│  │              │  │           [3 frames hidden]  pymbolic
│  │              │  │              1.289 map_sum  loopy/type_inference.py:170
│  │              │  │              └─ 1.128 __call__  pymbolic/mapper/__init__.py:114
│  │              │  │                    [3 frames hidden]  pymbolic
│  │              │  │                       1.038 map_sum  loopy/type_inference.py:170
│  │              │  └─ 1.156 check_bounds  loopy/check.py:460
│  │              └─ 1.758 insert_barriers  loopy/schedule/__init__.py:1776
│  │                 └─ 1.550 insert_barriers  loopy/schedule/__init__.py:1776
│  │                    └─ 1.078 insert_barriers_at_outer_level  loopy/schedule/__init__.py:1789
│  └─ 10.188 preprocess_kernel  loopy/preprocess.py:2030
│     ├─ 6.725 wrapper  loopy/transform/iname.py:1218
│     │  └─ 5.887 realize_reduction  loopy/preprocess.py:881
│     │     ├─ 3.210 __call__  pymbolic/mapper/__init__.py:114
│     │     │     [154 frames hidden]  pymbolic
│     │     │        2.462 map_reduction  loopy/symbolic.py:1815
│     │     │        └─ 2.462 map_reduction  loopy/preprocess.py:1690
│     │     │           └─ 2.445 map_reduction_seq  loopy/preprocess.py:1004
│     │     │              └─ 2.427 wrapper  pytools/__init__.py:675
│     │     │                 └─ 2.426 find_most_recent_global_barrier  loopy/kernel/tools.py:1655
│     │     │                    └─ 2.171 <genexpr>  loopy/kernel/tools.py:1670
│     │     │                       └─ 1.903 _is_global_barrier  loopy/kernel/tools.py:1583
│     │     └─ 2.525 replace_instruction_ids  loopy/transform/instruction.py:172
│     │        └─ 1.875 [self]  
│     ├─ 1.383 realize_ilp  loopy/preprocess.py:1965
│     │  └─ 1.383 privatize_temporaries_with_inames  loopy/transform/privatize.py:72
│     │     └─ 1.258 with_transformed_expressions  loopy/kernel/instruction.py:872
│     │        └─ 1.077 __call__  pymbolic/mapper/__init__.py:114
│     │              [136 frames hidden]  pymbolic
│     └─ 0.937 check_reduction_iname_uniqueness  loopy/preprocess.py:95
│        └─ 0.933 with_transformed_expressions  loopy/kernel/instruction.py:872
├─ 35.939 get_optimized_kernel  sumpy/e2e.py:127
│  ├─ 32.290 get_kernel  sumpy/e2e.py:146
│  │  ├─ 18.293 make_kernel  loopy/kernel/creation.py:1821
│  │  │  ├─ 4.981 duplicate_inames  loopy/transform/iname.py:818
│  │  │  │  ├─ 2.807 map_kernel  loopy/symbolic.py:995
│  │  │  │  │  └─ 2.803 <listcomp>  loopy/symbolic.py:1000
│  │  │  │  │     └─ 2.763 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │  │        └─ 2.414 <lambda>  loopy/symbolic.py:1002
│  │  │  │  │           └─ 2.400 __call__  loopy/symbolic.py:981
│  │  │  │  │              └─ 2.252 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │  │                    [181 frames hidden]  pymbolic
│  │  │  │  └─ 2.158 finish_kernel  loopy/symbolic.py:899
│  │  │  │     └─ 2.157 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│  │  │  │        └─ 2.157 <listcomp>  loopy/symbolic.py:792
│  │  │  │           └─ 2.152 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │              └─ 1.817 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                    [181 frames hidden]  pymbolic
│  │  │  ├─ 3.111 fix_parameters  loopy/transform/parameter.py:134
│  │  │  │  └─ 3.111 _fix_parameter  loopy/transform/parameter.py:67
│  │  │  │     ├─ 1.871 map_kernel  loopy/symbolic.py:995
│  │  │  │     │  └─ 1.707 <listcomp>  loopy/symbolic.py:1000
│  │  │  │     │     └─ 1.703 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │     │        └─ 1.503 <lambda>  loopy/symbolic.py:1002
│  │  │  │     │           └─ 1.496 __call__  loopy/symbolic.py:981
│  │  │  │     │              └─ 1.422 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │     │                    [158 frames hidden]  pymbolic
│  │  │  │     └─ 1.070 finish_kernel  loopy/symbolic.py:899
│  │  │  │        └─ 1.070 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│  │  │  │           └─ 1.070 <listcomp>  loopy/symbolic.py:792
│  │  │  │              └─ 1.064 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │                 └─ 0.877 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                       [158 frames hidden]  pymbolic
│  │  │  ├─ 1.760 determine_shapes_of_temporaries  loopy/kernel/creation.py:1512
│  │  │  │  └─ 1.408 find_shapes_of_vars  loopy/kernel/creation.py:1463
│  │  │  │     └─ 1.387 feed_all_expressions  loopy/kernel/creation.py:1523
│  │  │  │        └─ 1.384 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │           └─ 1.204 <lambda>  loopy/kernel/creation.py:1526
│  │  │  │              └─ 1.203 run_through_armap  loopy/kernel/creation.py:1469
│  │  │  │                 └─ 1.196 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                       [195 frames hidden]  pymbolic
│  │  │  ├─ 1.663 __init__  loopy/kernel/creation.py:1080
│  │  │  │  └─ 0.874 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │        [165 frames hidden]  pymbolic
│  │  │  ├─ 1.254 guess_arg_shape_if_requested  loopy/kernel/creation.py:1610
│  │  │  │  └─ 1.254 guess_var_shape  loopy/kernel/tools.py:985
│  │  │  │     └─ 1.248 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │        └─ 1.087 run_through_armap  loopy/kernel/tools.py:992
│  │  │  ├─ 1.226 guess_kernel_args_if_requested  loopy/kernel/creation.py:1170
│  │  │  │  └─ 1.215 make_new_arg  loopy/kernel/creation.py:1132
│  │  │  │     └─ 1.215 find_index_rank  loopy/kernel/creation.py:1116
│  │  │  │        └─ 1.212 with_transformed_expressions  loopy/kernel/instruction.py:872
│  │  │  │           └─ 1.059 run_irf  loopy/kernel/creation.py:1119
│  │  │  │              └─ 1.045 __call__  pymbolic/mapper/__init__.py:114
│  │  │  │                    [202 frames hidden]  pymbolic
│  │  │  └─ 1.146 expand_cses  loopy/kernel/creation.py:1321
│  │  │     └─ 0.979 __call__  pymbolic/mapper/__init__.py:114
│  │  │           [154 frames hidden]  pymbolic
│  │  └─ 13.927 get_translation_loopy_insns  sumpy/e2e.py:91
│  │     ├─ 9.054 to_loopy_insns  sumpy/codegen.py:672
│  │     │  ├─ 6.263 <listcomp>  sumpy/codegen.py:724
│  │     │  │  └─ 5.732 convert_expr  sumpy/codegen.py:705
│  │     │  │     └─ 5.685 __call__  pymbolic/mapper/__init__.py:114
│  │     │  │           [175 frames hidden]  pymbolic
│  │     │  └─ 1.098 kill_trivial_assignments  sumpy/codegen.py:154
│  │     │     └─ 1.074 substitute  pymbolic/mapper/substitutor.py:72
│  │     │           [168 frames hidden]  pymbolic
│  │     ├─ 3.728 run_global_cse  sumpy/assignment_collection.py:177
│  │     │  └─ 3.720 cse  sumpy/cse.py:550
│  │     │     └─ 2.980 opt_cse  sumpy/cse.py:357
│  │     │        └─ 2.582 match_common_args  sumpy/cse.py:266
│  │     │           └─ 0.898 get_subset_candidates  sumpy/cse.py:218
│  │     └─ 1.122 translate_from  sumpy/expansion/local.py:182
│  └─ 3.638 split_iname  loopy/transform/iname.py:334
│     └─ 3.635 _split_iname_backend  loopy/transform/iname.py:211
│        ├─ 1.407 map_kernel  loopy/symbolic.py:995
│        │  └─ 1.192 <listcomp>  loopy/symbolic.py:1000
│        │     └─ 1.189 with_transformed_expressions  loopy/kernel/instruction.py:872
│        │        └─ 1.023 <lambda>  loopy/symbolic.py:1002
│        │           └─ 1.019 __call__  loopy/symbolic.py:981
│        │              └─ 0.955 __call__  pymbolic/mapper/__init__.py:114
│        │                    [148 frames hidden]  pymbolic
│        └─ 1.227 finish_kernel  loopy/symbolic.py:899
│           └─ 1.227 rename_subst_rules_in_instructions  loopy/symbolic.py:788
│              └─ 1.227 <listcomp>  loopy/symbolic.py:792
│                 └─ 1.219 with_transformed_expressions  loopy/kernel/instruction.py:872
│                    └─ 1.046 __call__  pymbolic/mapper/__init__.py:114
│                          [151 frames hidden]  pymbolic
└─ 7.943 add_and_infer_dtypes  loopy/kernel/tools.py:106
   └─ 7.939 infer_unknown_types  loopy/type_inference.py:485
      ├─ 5.006 <dictcomp>  loopy/type_inference.py:527
      │  └─ 4.999 <setcomp>  loopy/type_inference.py:528
      │     └─ 4.866 [self]  
      └─ 2.537 _infer_var_type  loopy/type_inference.py:407
         ├─ 1.570 __call__  loopy/type_inference.py:60
         │  └─ 1.561 __call__  pymbolic/mapper/__init__.py:114
         │        [2 frames hidden]  pymbolic
         │           1.289 map_sum  loopy/type_inference.py:170
         │           └─ 1.123 __call__  pymbolic/mapper/__init__.py:114
         │                 [2 frames hidden]  pymbolic
         │                    0.993 map_sum  loopy/type_inference.py:170
         └─ 0.853 __call__  pymbolic/mapper/__init__.py:114
               [153 frames hidden]  pymbolic
isuruf commented 3 years ago

@inducer, https://github.com/inducer/pymbolic/pull/37 didn't help. Any other suggestions?

isuruf commented 3 years ago

align_two call at https://github.com/inducer/loopy/blob/186f5095a54982b7eb2fda5e4b995d7c047fde1e/loopy/codegen/instruction.py#L43 takes a long time.

That's fixed by https://github.com/inducer/loopy/pull/280