Open isuruf opened 3 years ago
This is the log sorted by the cumulative time spent. There doesn't seem to be an obvious low hanging fruit in this case:
146881842 function calls (140188954 primitive calls) in 97.556 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.019 0.019 96.991 96.991 loopy/loopy/codegen/__init__.py:404(generate_code_v2)
1588528 1.643 0.000 41.446 0.000 pymbolic/pymbolic/mapper/__init__.py:109(__call__)
1 0.000 0.000 36.843 36.843 loopy/loopy/schedule/__init__.py:2134(get_one_scheduled_kernel)
1 0.000 0.000 36.843 36.843 loopy/loopy/schedule/__init__.py:2143(get_one_linearized_kernel)
1 0.000 0.000 36.842 36.842 loopy/loopy/schedule/__init__.py:2121(_get_one_scheduled_kernel_inner)
2 0.003 0.001 36.842 18.421 loopy/loopy/schedule/__init__.py:1945(generate_loop_schedules_inner)
1 0.012 0.012 36.569 36.569 loopy/loopy/preprocess.py:2030(preprocess_kernel)
107846 0.113 0.000 36.025 0.001 {built-in method builtins.next}
2 0.000 0.000 35.952 17.976 loopy/loopy/schedule/__init__.py:1929(generate_loop_schedules)
9335052 2.563 0.000 32.603 0.000 pytools/__init__.py:675(wrapper)
1 0.000 0.000 30.326 30.326 loopy/loopy/transform/iname.py:1218(wrapper)
1 0.084 0.084 28.915 28.915 loopy/loopy/preprocess.py:881(realize_reduction)
507 0.001 0.000 23.352 0.046 loopy/loopy/symbolic.py:1815(map_reduction)
169 0.002 0.000 23.348 0.138 loopy/loopy/preprocess.py:1690(map_reduction)
169 0.006 0.000 23.318 0.138 loopy/loopy/preprocess.py:1004(map_reduction_seq)
7868 14.844 0.002 23.292 11.646 loopy/loopy/schedule/__init__.py:807(generate_loop_schedules_internal)
169 0.000 0.000 23.278 0.138 loopy/loopy/kernel/tools.py:1655(find_most_recent_global_barrier)
169 3.350 0.020 23.087 0.137 loopy/loopy/kernel/tools.py:1590(get_global_barrier_order)
If anyone wishes to reproduce this, here is the script.
Here's the pyinstrument profile,
136.566 <module> loopy_reproduce.py:1
├─ 73.198 generate_code_v2 loopy/codegen/__init__.py:404
│ ├─ 32.329 preprocess_kernel loopy/preprocess.py:2030
│ │ ├─ 26.890 wrapper loopy/transform/iname.py:1218
│ │ │ └─ 25.818 realize_reduction loopy/preprocess.py:881
│ │ │ ├─ 22.571 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [162 frames hidden] pymbolic
│ │ │ │ 21.620 map_reduction loopy/symbolic.py:1815
│ │ │ │ └─ 21.620 map_reduction loopy/preprocess.py:1690
│ │ │ │ └─ 21.579 map_reduction_seq loopy/preprocess.py:1004
│ │ │ │ └─ 21.565 wrapper pytools/__init__.py:675
│ │ │ │ └─ 21.563 find_most_recent_global_barrier loopy/kernel/tools.py:1655
│ │ │ │ └─ 21.562 wrapper pytools/__init__.py:675
│ │ │ │ └─ 21.314 get_global_barrier_order loopy/kernel/tools.py:1590
│ │ │ │ ├─ 11.304 compute_topological_order pytools/graph.py:210
│ │ │ │ │ ├─ 7.439 [self]
│ │ │ │ │ ├─ 1.473 __lt__ pytools/graph.py:206
│ │ │ │ │ └─ 1.424 dict.get <built-in>:0
│ │ │ │ ├─ 3.834 <listcomp> loopy/kernel/tools.py:1606
│ │ │ │ │ └─ 3.571 _is_global_barrier loopy/kernel/tools.py:1583
│ │ │ │ ├─ 2.835 [self]
│ │ │ │ └─ 2.336 <dictcomp> loopy/kernel/tools.py:1597
│ │ │ └─ 3.122 replace_instruction_ids loopy/transform/instruction.py:172
│ │ │ └─ 2.455 [self]
│ │ └─ 1.811 realize_ilp loopy/preprocess.py:1965
│ │ └─ 1.811 privatize_temporaries_with_inames loopy/transform/privatize.py:72
│ ├─ 24.813 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 24.804 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 24.702 build_insn_group loopy/codegen/control.py:330
│ │ └─ 24.702 gen_code loopy/codegen/control.py:456
│ │ └─ 24.702 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 24.675 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 23.941 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 23.894 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 23.883 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 23.807 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.786 gen_code loopy/codegen/control.py:456
│ │ └─ 23.786 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 23.786 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 23.766 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 23.702 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.447 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.415 build_insn_group loopy/codegen/control.py:330
│ │ └─ 22.938 gen_code loopy/codegen/control.py:456
│ │ └─ 22.938 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 22.938 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 22.911 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 22.466 build_insn_group loopy/codegen/control.py:330
│ │ ├─ 13.254 gen_code loopy/codegen/control.py:456
│ │ │ └─ 13.243 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ │ └─ 13.147 try_vectorized loopy/codegen/__init__.py:336
│ │ │ └─ 13.139 <lambda> loopy/codegen/control.py:170
│ │ │ └─ 13.138 generate_instruction_code loopy/codegen/instruction.py:74
│ │ │ ├─ 11.017 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ │ ├─ 6.939 align_two islpy/__init__.py:1224
│ │ │ │ │ [220 frames hidden] islpy
│ │ │ │ └─ 3.207 wrapper islpy/__init__.py:911
│ │ │ │ [68 frames hidden] islpy
│ │ │ │ 3.078 gist islpy/_isl.py:59605
│ │ │ │ └─ 2.801 Lib.isl_set_gist <built-in>:0
│ │ │ └─ 2.024 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ │ └─ 1.813 emit_assignment loopy/target/c/__init__.py:868
│ │ │ └─ 1.575 __call__ loopy/target/c/codegen/expression.py:118
│ │ │ └─ 1.506 rec loopy/target/c/codegen/expression.py:110
│ │ └─ 9.191 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.151 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.110 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.108 gen_code loopy/codegen/control.py:456
│ │ └─ 9.106 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 9.034 try_vectorized loopy/codegen/__init__.py:336
│ │ └─ 9.030 <lambda> loopy/codegen/control.py:170
│ │ └─ 9.028 generate_instruction_code loopy/codegen/instruction.py:74
│ │ ├─ 5.083 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ └─ 3.875 align_two islpy/__init__.py:1224
│ │ │ [221 frames hidden] islpy
│ │ └─ 3.887 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ └─ 3.723 emit_assignment loopy/target/c/__init__.py:868
│ │ └─ 3.603 __call__ loopy/target/c/codegen/expression.py:118
│ │ └─ 3.572 rec loopy/target/c/codegen/expression.py:110
│ │ ├─ 2.043 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.668 map_sum loopy/target/c/codegen/expression.py:561
│ │ │ └─ 1.667 base_impl loopy/target/c/codegen/expression.py:562
│ │ │ └─ 1.667 map_sum pymbolic/mapper/__init__.py:398
│ │ │ [16 frames hidden] pymbolic
│ │ │ 1.599 <genexpr> pymbolic/mapper/__init__.py:401
│ │ │ └─ 1.578 rec loopy/target/c/codegen/expression.py:110
│ │ │ └─ 1.564 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.526 map_product loopy/target/c/codegen/expression.py:610
│ │ │ └─ 1.518 base_impl loopy/target/c/codegen/expression.py:611
│ │ │ └─ 1.494 map_product pymbolic/mapper/__init__.py:403
│ │ │ [32 frames hidden] pymbolic
│ │ └─ 1.514 infer_type loopy/target/c/codegen/expression.py:78
│ │ └─ 1.483 __call__ loopy/type_inference.py:60
│ │ └─ 1.472 __call__ pymbolic/mapper/__init__.py:114
│ │ [2 frames hidden] pymbolic
│ │ 1.468 map_sum loopy/type_inference.py:170
│ └─ 14.247 get_one_scheduled_kernel loopy/schedule/__init__.py:2134
│ └─ 14.247 get_one_linearized_kernel loopy/schedule/__init__.py:2143
│ └─ 14.246 _get_one_scheduled_kernel_inner loopy/schedule/__init__.py:2121
│ └─ 14.206 generate_loop_schedules loopy/schedule/__init__.py:1929
│ └─ 14.206 generate_loop_schedules_inner loopy/schedule/__init__.py:1945
│ ├─ 10.221 pre_schedule_checks loopy/check.py:799
│ │ ├─ 5.407 check_variable_access_ordered loopy/check.py:762
│ │ │ └─ 5.407 _check_variable_access_ordered_inner loopy/check.py:604
│ │ │ └─ 3.656 do_access_ranges_overlap_conservative loopy/symbolic.py:2194
│ │ │ └─ 2.114 _get_access_range_for_var loopy/symbolic.py:2179
│ │ │ └─ 1.982 wrapper pytools/__init__.py:675
│ │ │ └─ 1.930 _get_access_ranges loopy/symbolic.py:2154
│ │ │ └─ 1.819 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.817 map_subscript loopy/symbolic.py:2049
│ │ │ └─ 1.765 get_access_map loopy/symbolic.py:1906
│ │ ├─ 1.720 check_for_integer_subscript_indices loopy/check.py:114
│ │ │ └─ 1.679 __call__ loopy/type_inference.py:60
│ │ │ └─ 1.671 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [3 frames hidden] pymbolic
│ │ │ 1.650 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.448 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ └─ 1.610 check_bounds loopy/check.py:460
│ └─ 2.507 insert_barriers loopy/schedule/__init__.py:1776
│ └─ 2.125 insert_barriers loopy/schedule/__init__.py:1776
│ └─ 1.438 insert_barriers_at_outer_level loopy/schedule/__init__.py:1789
├─ 52.686 get_optimized_kernel sumpy/e2e.py:127
│ ├─ 47.370 get_kernel sumpy/e2e.py:146
│ │ ├─ 25.594 make_kernel loopy/kernel/creation.py:1821
│ │ │ ├─ 7.046 duplicate_inames loopy/transform/iname.py:818
│ │ │ │ ├─ 3.649 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 3.645 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 3.568 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 2.990 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 2.975 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 2.742 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [184 frames hidden] pymbolic
│ │ │ │ └─ 3.377 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 3.376 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 3.376 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 3.361 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 2.770 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [171 frames hidden] pymbolic
│ │ │ ├─ 4.475 fix_parameters loopy/transform/parameter.py:134
│ │ │ │ └─ 4.475 _fix_parameter loopy/transform/parameter.py:67
│ │ │ │ ├─ 2.478 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 2.203 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 2.192 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 1.891 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 1.882 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 1.780 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [163 frames hidden] pymbolic
│ │ │ │ └─ 1.703 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 1.703 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 1.703 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 1.693 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.394 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [168 frames hidden] pymbolic
│ │ │ ├─ 2.530 determine_shapes_of_temporaries loopy/kernel/creation.py:1512
│ │ │ │ └─ 1.912 find_shapes_of_vars loopy/kernel/creation.py:1463
│ │ │ │ └─ 1.880 feed_all_expressions loopy/kernel/creation.py:1523
│ │ │ │ └─ 1.871 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.587 <lambda> loopy/kernel/creation.py:1526
│ │ │ │ └─ 1.584 run_through_armap loopy/kernel/creation.py:1469
│ │ │ │ └─ 1.564 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [218 frames hidden] pymbolic
│ │ │ ├─ 2.095 __init__ loopy/kernel/creation.py:1080
│ │ │ ├─ 1.769 guess_arg_shape_if_requested loopy/kernel/creation.py:1610
│ │ │ │ └─ 1.769 guess_var_shape loopy/kernel/tools.py:985
│ │ │ │ └─ 1.758 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.453 run_through_armap loopy/kernel/tools.py:992
│ │ │ ├─ 1.690 guess_kernel_args_if_requested loopy/kernel/creation.py:1170
│ │ │ │ └─ 1.670 make_new_arg loopy/kernel/creation.py:1132
│ │ │ │ └─ 1.670 find_index_rank loopy/kernel/creation.py:1116
│ │ │ │ └─ 1.660 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.392 run_irf loopy/kernel/creation.py:1119
│ │ │ │ └─ 1.368 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [220 frames hidden] pymbolic
│ │ │ └─ 1.464 expand_cses loopy/kernel/creation.py:1321
│ │ └─ 21.709 get_translation_loopy_insns sumpy/e2e.py:91
│ │ ├─ 16.169 to_loopy_insns sumpy/codegen.py:679
│ │ │ ├─ 8.331 <listcomp> sumpy/codegen.py:731
│ │ │ │ └─ 7.319 convert_expr sumpy/codegen.py:712
│ │ │ │ └─ 7.236 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [187 frames hidden] pymbolic
│ │ │ └─ 5.620 kill_trivial_assignments sumpy/codegen.py:161
│ │ │ ├─ 2.872 substitute pymbolic/mapper/substitutor.py:72
│ │ │ │ [212 frames hidden] pymbolic
│ │ │ │ 1.436 dict.copy <built-in>:0
│ │ │ └─ 1.480 make_one_step_subst sumpy/codegen.py:78
│ │ └─ 4.305 run_global_cse sumpy/assignment_collection.py:164
│ │ └─ 4.291 cse sumpy/cse.py:550
│ │ └─ 3.400 opt_cse sumpy/cse.py:357
│ │ └─ 2.921 match_common_args sumpy/cse.py:266
│ └─ 5.299 split_iname loopy/transform/iname.py:334
│ └─ 5.294 _split_iname_backend loopy/transform/iname.py:211
│ ├─ 2.243 map_kernel loopy/symbolic.py:995
│ │ └─ 1.868 <listcomp> loopy/symbolic.py:1000
│ │ └─ 1.853 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.560 <lambda> loopy/symbolic.py:1002
│ │ └─ 1.552 __call__ loopy/symbolic.py:981
│ │ └─ 1.442 __call__ pymbolic/mapper/__init__.py:114
│ │ [161 frames hidden] pymbolic
│ └─ 1.687 finish_kernel loopy/symbolic.py:899
│ └─ 1.687 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ └─ 1.687 <listcomp> loopy/symbolic.py:792
│ └─ 1.673 with_transformed_expressions loopy/kernel/instruction.py:872
│ └─ 1.375 __call__ pymbolic/mapper/__init__.py:114
│ [173 frames hidden] pymbolic
└─ 8.944 add_and_infer_dtypes loopy/kernel/tools.py:106
└─ 8.937 infer_unknown_types loopy/type_inference.py:485
├─ 4.969 <dictcomp> loopy/type_inference.py:527
│ └─ 4.954 <setcomp> loopy/type_inference.py:528
│ └─ 4.709 [self]
└─ 3.164 _infer_var_type loopy/type_inference.py:407
└─ 1.823 __call__ loopy/type_inference.py:60
└─ 1.812 __call__ pymbolic/mapper/__init__.py:114
[2 frames hidden] pymbolic
1.734 map_sum loopy/type_inference.py:170
└─ 1.523 __call__ pymbolic/mapper/__init__.py:114
[2 frames hidden] pymbolic
After a couple of improvements to loopy and sumpy (derivtaker branch) pyinstrument output is now,
84.866 <module> loopy_reproduce.py:1
├─ 39.403 generate_code_v2 loopy/codegen/__init__.py:404
│ ├─ 16.501 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 16.494 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 16.419 build_insn_group loopy/codegen/control.py:330
│ │ └─ 16.419 gen_code loopy/codegen/control.py:456
│ │ └─ 16.418 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 16.401 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 15.954 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 15.923 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 15.916 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 15.869 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.859 gen_code loopy/codegen/control.py:456
│ │ └─ 15.859 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 15.859 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 15.841 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 15.797 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.188 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.155 build_insn_group loopy/codegen/control.py:330
│ │ └─ 14.651 gen_code loopy/codegen/control.py:456
│ │ └─ 14.651 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 14.650 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 14.629 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 14.391 build_insn_group loopy/codegen/control.py:330
│ │ ├─ 8.649 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.607 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.563 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.561 gen_code loopy/codegen/control.py:456
│ │ │ └─ 8.559 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ │ └─ 8.538 try_vectorized loopy/codegen/__init__.py:336
│ │ │ └─ 8.537 <lambda> loopy/codegen/control.py:170
│ │ │ └─ 8.537 generate_instruction_code loopy/codegen/instruction.py:74
│ │ │ ├─ 5.304 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ │ ├─ 3.620 align_two islpy/__init__.py:1224
│ │ │ │ │ [218 frames hidden] islpy
│ │ │ │ └─ 1.243 wrapper islpy/__init__.py:911
│ │ │ │ [63 frames hidden] islpy
│ │ │ │ 1.194 gist islpy/_isl.py:59605
│ │ │ │ └─ 1.140 Lib.isl_set_gist <built-in>:0
│ │ │ └─ 3.211 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ │ └─ 3.165 emit_assignment loopy/target/c/__init__.py:868
│ │ │ └─ 3.120 __call__ loopy/target/c/codegen/expression.py:118
│ │ │ └─ 3.103 rec loopy/target/c/codegen/expression.py:110
│ │ │ ├─ 1.642 infer_type loopy/target/c/codegen/expression.py:78
│ │ │ │ └─ 1.633 __call__ loopy/type_inference.py:60
│ │ │ │ └─ 1.627 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [2 frames hidden] pymbolic
│ │ │ │ 1.622 map_sum loopy/type_inference.py:170
│ │ │ │ └─ 1.511 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.432 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.458 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.228 map_sum loopy/target/c/codegen/expression.py:561
│ │ │ └─ 1.226 base_impl loopy/target/c/codegen/expression.py:562
│ │ │ └─ 1.226 map_sum pymbolic/mapper/__init__.py:398
│ │ │ [17 frames hidden] pymbolic
│ │ │ 1.143 <genexpr> pymbolic/mapper/__init__.py:401
│ │ │ └─ 1.126 rec loopy/target/c/codegen/expression.py:110
│ │ │ └─ 1.111 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.072 map_product loopy/target/c/codegen/expression.py:610
│ │ │ └─ 1.058 base_impl loopy/target/c/codegen/expression.py:611
│ │ │ └─ 1.045 map_product pymbolic/mapper/__init__.py:403
│ │ │ [32 frames hidden] pymbolic
│ │ └─ 5.729 gen_code loopy/codegen/control.py:456
│ │ └─ 5.725 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 5.707 try_vectorized loopy/codegen/__init__.py:336
│ │ └─ 5.707 <lambda> loopy/codegen/control.py:170
│ │ └─ 5.707 generate_instruction_code loopy/codegen/instruction.py:74
│ │ └─ 5.170 to_codegen_result loopy/codegen/instruction.py:34
│ │ ├─ 3.086 align_two islpy/__init__.py:1224
│ │ │ [219 frames hidden] islpy
│ │ └─ 1.756 wrapper islpy/__init__.py:911
│ │ [50 frames hidden] islpy
│ │ 1.716 gist islpy/_isl.py:59605
│ │ └─ 1.686 Lib.isl_set_gist <built-in>:0
│ ├─ 11.540 get_one_scheduled_kernel loopy/schedule/__init__.py:2134
│ │ └─ 11.540 get_one_linearized_kernel loopy/schedule/__init__.py:2143
│ │ └─ 11.539 _get_one_scheduled_kernel_inner loopy/schedule/__init__.py:2121
│ │ └─ 11.503 generate_loop_schedules loopy/schedule/__init__.py:1929
│ │ └─ 11.503 generate_loop_schedules_inner loopy/schedule/__init__.py:1945
│ │ ├─ 9.040 pre_schedule_checks loopy/check.py:799
│ │ │ ├─ 5.438 check_variable_access_ordered loopy/check.py:762
│ │ │ │ └─ 5.438 _check_variable_access_ordered_inner loopy/check.py:604
│ │ │ │ ├─ 3.505 do_access_ranges_overlap_conservative loopy/symbolic.py:2194
│ │ │ │ │ ├─ 2.047 _get_access_range_for_var loopy/symbolic.py:2179
│ │ │ │ │ │ └─ 1.882 wrapper pytools/__init__.py:675
│ │ │ │ │ │ └─ 1.824 _get_access_ranges loopy/symbolic.py:2154
│ │ │ │ │ │ └─ 1.725 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ │ [4 frames hidden] pymbolic
│ │ │ │ │ │ 1.723 map_subscript loopy/symbolic.py:2049
│ │ │ │ │ │ └─ 1.695 get_access_map loopy/symbolic.py:1906
│ │ │ │ │ │ └─ 0.969 guarded_aff_from_expr loopy/symbolic.py:1514
│ │ │ │ │ │ └─ 0.965 with_aff_conversion_guard loopy/symbolic.py:1492
│ │ │ │ │ │ └─ 0.891 aff_from_expr loopy/symbolic.py:1473
│ │ │ │ │ │ └─ 0.870 pwaff_from_expr loopy/symbolic.py:1488
│ │ │ │ │ └─ 1.257 obj_and islpy/__init__.py:295
│ │ │ │ │ [38 frames hidden] islpy
│ │ │ │ └─ 0.968 discard_dep_reqs_in_order loopy/check.py:663
│ │ │ ├─ 1.354 check_for_integer_subscript_indices loopy/check.py:114
│ │ │ │ └─ 1.321 __call__ loopy/type_inference.py:60
│ │ │ │ └─ 1.315 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.289 map_sum loopy/type_inference.py:170
│ │ │ │ └─ 1.128 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.038 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.156 check_bounds loopy/check.py:460
│ │ └─ 1.758 insert_barriers loopy/schedule/__init__.py:1776
│ │ └─ 1.550 insert_barriers loopy/schedule/__init__.py:1776
│ │ └─ 1.078 insert_barriers_at_outer_level loopy/schedule/__init__.py:1789
│ └─ 10.188 preprocess_kernel loopy/preprocess.py:2030
│ ├─ 6.725 wrapper loopy/transform/iname.py:1218
│ │ └─ 5.887 realize_reduction loopy/preprocess.py:881
│ │ ├─ 3.210 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [154 frames hidden] pymbolic
│ │ │ 2.462 map_reduction loopy/symbolic.py:1815
│ │ │ └─ 2.462 map_reduction loopy/preprocess.py:1690
│ │ │ └─ 2.445 map_reduction_seq loopy/preprocess.py:1004
│ │ │ └─ 2.427 wrapper pytools/__init__.py:675
│ │ │ └─ 2.426 find_most_recent_global_barrier loopy/kernel/tools.py:1655
│ │ │ └─ 2.171 <genexpr> loopy/kernel/tools.py:1670
│ │ │ └─ 1.903 _is_global_barrier loopy/kernel/tools.py:1583
│ │ └─ 2.525 replace_instruction_ids loopy/transform/instruction.py:172
│ │ └─ 1.875 [self]
│ ├─ 1.383 realize_ilp loopy/preprocess.py:1965
│ │ └─ 1.383 privatize_temporaries_with_inames loopy/transform/privatize.py:72
│ │ └─ 1.258 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.077 __call__ pymbolic/mapper/__init__.py:114
│ │ [136 frames hidden] pymbolic
│ └─ 0.937 check_reduction_iname_uniqueness loopy/preprocess.py:95
│ └─ 0.933 with_transformed_expressions loopy/kernel/instruction.py:872
├─ 35.939 get_optimized_kernel sumpy/e2e.py:127
│ ├─ 32.290 get_kernel sumpy/e2e.py:146
│ │ ├─ 18.293 make_kernel loopy/kernel/creation.py:1821
│ │ │ ├─ 4.981 duplicate_inames loopy/transform/iname.py:818
│ │ │ │ ├─ 2.807 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 2.803 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 2.763 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 2.414 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 2.400 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 2.252 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [181 frames hidden] pymbolic
│ │ │ │ └─ 2.158 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 2.157 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 2.157 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 2.152 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.817 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [181 frames hidden] pymbolic
│ │ │ ├─ 3.111 fix_parameters loopy/transform/parameter.py:134
│ │ │ │ └─ 3.111 _fix_parameter loopy/transform/parameter.py:67
│ │ │ │ ├─ 1.871 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 1.707 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 1.703 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 1.503 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 1.496 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 1.422 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [158 frames hidden] pymbolic
│ │ │ │ └─ 1.070 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 1.070 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 1.070 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 1.064 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 0.877 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [158 frames hidden] pymbolic
│ │ │ ├─ 1.760 determine_shapes_of_temporaries loopy/kernel/creation.py:1512
│ │ │ │ └─ 1.408 find_shapes_of_vars loopy/kernel/creation.py:1463
│ │ │ │ └─ 1.387 feed_all_expressions loopy/kernel/creation.py:1523
│ │ │ │ └─ 1.384 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.204 <lambda> loopy/kernel/creation.py:1526
│ │ │ │ └─ 1.203 run_through_armap loopy/kernel/creation.py:1469
│ │ │ │ └─ 1.196 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [195 frames hidden] pymbolic
│ │ │ ├─ 1.663 __init__ loopy/kernel/creation.py:1080
│ │ │ │ └─ 0.874 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [165 frames hidden] pymbolic
│ │ │ ├─ 1.254 guess_arg_shape_if_requested loopy/kernel/creation.py:1610
│ │ │ │ └─ 1.254 guess_var_shape loopy/kernel/tools.py:985
│ │ │ │ └─ 1.248 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.087 run_through_armap loopy/kernel/tools.py:992
│ │ │ ├─ 1.226 guess_kernel_args_if_requested loopy/kernel/creation.py:1170
│ │ │ │ └─ 1.215 make_new_arg loopy/kernel/creation.py:1132
│ │ │ │ └─ 1.215 find_index_rank loopy/kernel/creation.py:1116
│ │ │ │ └─ 1.212 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.059 run_irf loopy/kernel/creation.py:1119
│ │ │ │ └─ 1.045 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [202 frames hidden] pymbolic
│ │ │ └─ 1.146 expand_cses loopy/kernel/creation.py:1321
│ │ │ └─ 0.979 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [154 frames hidden] pymbolic
│ │ └─ 13.927 get_translation_loopy_insns sumpy/e2e.py:91
│ │ ├─ 9.054 to_loopy_insns sumpy/codegen.py:672
│ │ │ ├─ 6.263 <listcomp> sumpy/codegen.py:724
│ │ │ │ └─ 5.732 convert_expr sumpy/codegen.py:705
│ │ │ │ └─ 5.685 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [175 frames hidden] pymbolic
│ │ │ └─ 1.098 kill_trivial_assignments sumpy/codegen.py:154
│ │ │ └─ 1.074 substitute pymbolic/mapper/substitutor.py:72
│ │ │ [168 frames hidden] pymbolic
│ │ ├─ 3.728 run_global_cse sumpy/assignment_collection.py:177
│ │ │ └─ 3.720 cse sumpy/cse.py:550
│ │ │ └─ 2.980 opt_cse sumpy/cse.py:357
│ │ │ └─ 2.582 match_common_args sumpy/cse.py:266
│ │ │ └─ 0.898 get_subset_candidates sumpy/cse.py:218
│ │ └─ 1.122 translate_from sumpy/expansion/local.py:182
│ └─ 3.638 split_iname loopy/transform/iname.py:334
│ └─ 3.635 _split_iname_backend loopy/transform/iname.py:211
│ ├─ 1.407 map_kernel loopy/symbolic.py:995
│ │ └─ 1.192 <listcomp> loopy/symbolic.py:1000
│ │ └─ 1.189 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.023 <lambda> loopy/symbolic.py:1002
│ │ └─ 1.019 __call__ loopy/symbolic.py:981
│ │ └─ 0.955 __call__ pymbolic/mapper/__init__.py:114
│ │ [148 frames hidden] pymbolic
│ └─ 1.227 finish_kernel loopy/symbolic.py:899
│ └─ 1.227 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ └─ 1.227 <listcomp> loopy/symbolic.py:792
│ └─ 1.219 with_transformed_expressions loopy/kernel/instruction.py:872
│ └─ 1.046 __call__ pymbolic/mapper/__init__.py:114
│ [151 frames hidden] pymbolic
└─ 7.943 add_and_infer_dtypes loopy/kernel/tools.py:106
└─ 7.939 infer_unknown_types loopy/type_inference.py:485
├─ 5.006 <dictcomp> loopy/type_inference.py:527
│ └─ 4.999 <setcomp> loopy/type_inference.py:528
│ └─ 4.866 [self]
└─ 2.537 _infer_var_type loopy/type_inference.py:407
├─ 1.570 __call__ loopy/type_inference.py:60
│ └─ 1.561 __call__ pymbolic/mapper/__init__.py:114
│ [2 frames hidden] pymbolic
│ 1.289 map_sum loopy/type_inference.py:170
│ └─ 1.123 __call__ pymbolic/mapper/__init__.py:114
│ [2 frames hidden] pymbolic
│ 0.993 map_sum loopy/type_inference.py:170
└─ 0.853 __call__ pymbolic/mapper/__init__.py:114
[153 frames hidden] pymbolic
@inducer, https://github.com/inducer/pymbolic/pull/37 didn't help. Any other suggestions?
align_two
call at https://github.com/inducer/loopy/blob/186f5095a54982b7eb2fda5e4b995d7c047fde1e/loopy/codegen/instruction.py#L43 takes a long time.
That's fixed by https://github.com/inducer/loopy/pull/280
Thanks to @kaushikcfd, scheduling is now super fast compared to other parts of loopy. Still,
make_kernel, preprocess_kernel, codegen
take so much time that some sumpy kernels are unusable.Here's a small example with https://github.com/isuruf/sumpy/tree/derivtaker