What will you do?
Implement a pass in LLVM that automatically flatten the nested for loop [1] which provides an opportunity for better applying other loop optimizing techniques.
How will you do it?
Based on the llvm-pass-Skeleton framework, add the pass to automatically flatten all the nested for loops.
Milestone 1: flatten the perfect nested loop with a constant loop boundary
Milestone 2: flatten the perfect nested loop with a variable loop boundary
Milestone 3: flatten arbitrary nested loop
How will you empirically measure success?
Apply this auto-flattening pass and other available loop optimizing techniques [3] like loop vectorization and unrolling. With the help of loop flattening, we can possibly apply a larger vectorization factor and gain a better speedup on the program runtime.
(1) Create simple benchmarks that correspond to the milestones.
(2) Pick realistic benchmarks (the ones with nested for loop) from embench [4].
We compare the runtime on the mentioned benchmarks between the one with LLVM auto-vectorization pass [5] and the one with auto-flattening pass + LLVM auto-vectorization. The one with the auto-flattening pass should provide better performance under several specific types of benchmarks.
Team members:
[@mention their GitHub usernames, if it's not just you]
@hj424
Collaborators are very welcome if you are also interested in this topic!
What will you do? Implement a pass in LLVM that automatically flatten the nested for loop [1] which provides an opportunity for better applying other loop optimizing techniques.
How will you do it? Based on the llvm-pass-Skeleton framework, add the pass to automatically flatten all the nested for loops. Milestone 1: flatten the perfect nested loop with a constant loop boundary Milestone 2: flatten the perfect nested loop with a variable loop boundary Milestone 3: flatten arbitrary nested loop
How will you empirically measure success? Apply this auto-flattening pass and other available loop optimizing techniques [3] like loop vectorization and unrolling. With the help of loop flattening, we can possibly apply a larger vectorization factor and gain a better speedup on the program runtime.
(1) Create simple benchmarks that correspond to the milestones. (2) Pick realistic benchmarks (the ones with nested for loop) from embench [4].
We compare the runtime on the mentioned benchmarks between the one with LLVM auto-vectorization pass [5] and the one with auto-flattening pass + LLVM auto-vectorization. The one with the auto-flattening pass should provide better performance under several specific types of benchmarks.
Team members: [@mention their GitHub usernames, if it's not just you] @hj424 Collaborators are very welcome if you are also interested in this topic!
References: [1] A. Ghuloum, A. Fisher, “Flattening and Parallelizing Irregular, Recurrent Loop Nests”, PPOPP, 1995. [2] Loop optimizations in LLVM: https://llvm.org/devmtg/2018-10/slides/Kruse-LoopTransforms.pdf [3] LLVM code transformation metadata on loop optimization: https://llvm.org/docs/TransformMetadata.html#loop-vectorization-and-interleaving [4] Embench: Open Benchmarks for Embedded Platforms: https://github.com/embench/embench-iot [5] LLVM auto-vectorization in LLVM: https://llvm.org/docs/Vectorizers.html [6] Loop flatten techniques applied in software pipelining (converting loop level parallelism into instruction parallelism): https://people.ece.uw.edu/hauck/publications/LoopFlattening.pdf