llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.06k stars 11.59k forks source link

Connect the BBVectorizer with Polly #12776

Closed tobiasgrosser closed 8 years ago

tobiasgrosser commented 12 years ago
Bugzilla Link 12404
Resolution FIXED
Resolved on Jan 18, 2016 17:00
Version unspecified
OS Linux
Depends On llvm/llvm-project#12770
CC @etherzhhb,@sunfishcode

Extended Description

At the moment we, have our own code for the generation of vector instructions. This works OK, but, especially if we want to improve the vectorization, a lot of intelligence would be needed directly in Polly. This may cause a lot of work and maintenence overhead.

One option to get around this is the basic block vectorizer. The idea here is performthe high level transformations and the strip mining for to create trivially vectorizable loops within Polly, but, instead of replacing the trivially vectorizable loops with vector instructions, we just do 'grouped unrolling'. This means we unroll the loop and reorder the instructions such that different iterations of the same instruction come right after each other. We then use the basic block vectorizer to do the vector instruction generation. This allows us share the vector instruction generation logic with the basic block vectorizer and to take advantage of the heuristics it uses.

tlattner commented 8 years ago

Move to Polly Product.

tobiasgrosser commented 11 years ago

We can now use the bb vectorizer after Polly by using the flag -polly-vectorizer=bb.

fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

To the best of my knowledge, the current vectorization approach in Polly cannot apply to loop body with nolinear CFG, if we want to vectroize loop body with nolinear CFG, some approach translate control dependence to data dependence like whats described in WFV is needed.

True. Several of the transformations from the WFV may be useful (do they have If-Conversion?).

I need to look deep in to the code to answer this, i feel they perfrom some kind of "If-Conversion" with vector select or shuffle, and i doubt we cannot perform If-Conversion on the LLVM-IR level, even we can mask away the result of some computation, we cannot disable load/store with predicate at LLVM-IR level.

In fact, I think one drawback of the WFV is that the speed performance of the WFVed function is limited by the execution trace with longest delay (critical trace). I doubt the WFVed function may have worse perfromance than the origianl function if the the execution frequency of the critical trace is extreme low.

When we trying to vectorize loop body with nolinear CFG, we should be careful of this situation.

tobiasgrosser commented 12 years ago

Use Whole-Function Vectorization[1, 2] is interesting as well.

[1]http://www.cdl.uni-saarland.de/projects/wfv/ [2]https://github.com/karrenberg/Whole-Function-Vectorization

Sure, but I am not sure how this is related to Polly.

Whole function vectorization is specific to data parallel code where we can execute several iterations of a function in parallel. Polly instead works on sequential code that we may want to vectorize.

How would you combine whole function vectorization and Polly? (There may be ways, but being explicit about them may help more than just the keyword) To the best of my knowledge, the current vectorization approach in Polly cannot apply to loop body with nolinear CFG, if we want to vectroize loop body with nolinear CFG, some approach translate control dependence to data dependence like whats described in WFV is needed.

True. Several of the transformations from the WFV may be useful (do they have If-Conversion?).

fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

Use Whole-Function Vectorization[1, 2] is interesting as well.

[1]http://www.cdl.uni-saarland.de/projects/wfv/ [2]https://github.com/karrenberg/Whole-Function-Vectorization

Sure, but I am not sure how this is related to Polly.

Whole function vectorization is specific to data parallel code where we can execute several iterations of a function in parallel. Polly instead works on sequential code that we may want to vectorize.

How would you combine whole function vectorization and Polly? (There may be ways, but being explicit about them may help more than just the keyword) To the best of my knowledge, the current vectorization approach in Polly cannot apply to loop body with nolinear CFG, if we want to vectroize loop body with nolinear CFG, some approach translate control dependence to data dependence like whats described in WFV is needed.

fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

A possible solution is:

  1. Refactor BBVectorize so we can call the "fuseChosenPairs" alone. This function performs the actual vectorization.
  2. Build AllPairableInsts and AllChosenPairs in our codegeneration pass with I thinkt the scalar mess that Polly generates may make complicate the code generation. As a first step I would try to run polly-codegen, schedule some scalar cleanup passes and than run the bbvectorizer. For this aproach we would run Polly with vector code generation, but without actually issuing vector instructions. Instead we do some kind of 'grouped unrolling'. This means we fully unroll the vector loop, but in a way that subsequent iterations of the same statement are issued right after each other[1]. This should allow vectorization without a lot of instruction reordering such that we can use -bb-vectorize-fast-dep to speed up the code. Also, it avoids problems cause by the alias analysis not understanding that instruction reordering may be possible I think we should keep in mind that we should try to vectorize the unrolled BB only, vectorizing other BBs will not generates incorrect code, but may increase compile time. Instead of scheduling scalar cleanup passes, performing cleanups right after the loop is unrolled is possible. There is an example in the UnrollLoop[1], after the loop is unrolled, the function perform several cleanups including:
  3. Simplify the cfg of the loop body by trying to merge adjacent basic blocks.
  4. Perform constant propagation and dead code elimination.

If we can perform these cleanups for the grouped unrolled loop in the codegen pass (hopefully reusing the code in the UnrollLoop function), we can call the BBVectorize to vectrorize the unrolled loop in the codegen pass. As a result, the vectorization is only applied to the unrolled loop.

[1]http://llvm.org/doxygen/LoopUnroll_8cpp_source.html

fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

Instead we do some kind of 'grouped unrolling'. This means we fully unroll the vector loop, but in a way that subsequent iterations of the same statement are issued right after each other[1]. This should allow vectorization without a lot of instruction reordering such that we can use -bb-vectorize-fast-dep to speed up the code. Also, it avoids problems cause by the alias analysis not understanding that instruction reordering may be possible

1) Grouped unrolling is a three line patch to Polly. You just need to disable the creation of vector loads. I can point people to the right place

Looks like we can first emit the code of the first iteration, then we can reuse some code in the UnrollLoop utility function.

tobiasgrosser commented 12 years ago

A possible solution is:

  1. Refactor BBVectorize so we can call the "fuseChosenPairs" alone. This function performs the actual vectorization.
  2. Build AllPairableInsts and AllChosenPairs in our codegeneration pass with parallelism information, instead of letting the BBVectorize pass to build these things from scratch.
  3. Call fuseChosenPairs with AllPairableInsts and AllChosenPairs built by the codegeneration pass.

This fuses Polly and the BBVectorizer closely. It may be a good final solution, but at the moment, I thinkt the scalar mess that Polly generates may make complicate the code generation. As a first step I would try to run polly-codegen, schedule some scalar cleanup passes and than run the bbvectorizer. For this aproach we would run Polly with vector code generation, but without actually issuing vector instructions. Instead we do some kind of 'grouped unrolling'. This means we fully unroll the vector loop, but in a way that subsequent iterations of the same statement are issued right after each other[1]. This should allow vectorization without a lot of instruction reordering such that we can use -bb-vectorize-fast-dep to speed up the code. Also, it avoids problems cause by the alias analysis not understanding that instruction reordering may be possible

1) Grouped unrolling is a three line patch to Polly. You just need to disable the creation of vector loads. I can point people to the right place

tobiasgrosser commented 12 years ago

Use Whole-Function Vectorization[1, 2] is interesting as well.

[1]http://www.cdl.uni-saarland.de/projects/wfv/ [2]https://github.com/karrenberg/Whole-Function-Vectorization

Sure, but I am not sure how this is related to Polly.

Whole function vectorization is specific to data parallel code where we can execute several iterations of a function in parallel. Polly instead works on sequential code that we may want to vectorize.

How would you combine whole function vectorization and Polly? (There may be ways, but being explicit about them may help more than just the keyword)

fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

A possible solution is:

  1. Refactor BBVectorize so we can call the "fuseChosenPairs" alone. This function performs the actual vectorization.
  2. Build AllPairableInsts and AllChosenPairs in our codegeneration pass with parallelism information, instead of letting the BBVectorize pass to build these things from scratch.
  3. Call fuseChosenPairs with AllPairableInsts and AllChosenPairs built by the codegeneration pass.
fecdd795-9db7-406f-9081-05c31e939c0d commented 12 years ago

Use Whole-Function Vectorization[1, 2] is interesting as well.

[1]http://www.cdl.uni-saarland.de/projects/wfv/ [2]https://github.com/karrenberg/Whole-Function-Vectorization