plaidml / tpp-mlir

TPP experimentation on MLIR for linear algebra
https://arxiv.org/abs/2404.15204
Other
111 stars 29 forks source link

Modularize backend passes #388

Closed adam-smnk closed 4 months ago

adam-smnk commented 1 year ago

Similarly to tpp-opt pass modularization (#280), the second half of the lowering pipeline, present in tpp-run and executed after the default TPP pipeline, should be cleaned up, split into sub-passes, and potentially improved e.g., extra LICM after convert-linalg-to-loops.

The backend split into sub-passes should follow similar approach to the split done in the DefaultTppPasses. The newly created passes should have matching tests to check their overall functionality if it is non-trivial i.e., anything more than a basic cleanup or invoking more than general upstream passes.

rengolin commented 1 year ago

Default Pipeline

  // Pre processing
  PreProcessPasses()

  // Tensor Passes  
  if (pack)
    PackPasses()
  if (tileAndFuse)
    TileAndFusePasses()

  // Bufferization
  Bufferize()

  // MemRef Passes
  if (linalgToLoops)
    LinalgToLoops()
  else if (linalgToXSMM)
    LinalgToXSMM()

  // Post processing
  PostProcessing()

  // Lowering
  PartialLosering()
  FinalLowering()

Equivalent command lines:

Tool Tensor Level Passes Bufferize MemRef Level Passes Lowering
$ tpp-opt --pre-process --pack --tile-and-fuse --bufferize --linalg-to-xsmm --post-processing --partial-lowering --final-lowering
$ tpp-opt --tensor-passes="pack=1,tile=1" --bufferize --memref-passes="linalg-to-xsmm" --lowering
$ tpp-opt --default-pass-pipeline --lowering
$ tpp-run

Note: The default pass pipeline shouldn't lower to LLVM to not conflict with tpp-run, though we can make it work, too. Note: tpp-run runs the entire pipeline, including lowering to LLVM.

Bundles

CleanUp

  pm.addPass(createTransformDropSchedulePass());
  pm.addPass(createCanonicalizerPass());
  pm.addPass(createCSEPass());
  pm.addPass(createSymbolDCEPass());
  pm.addPass(createReconcileUnrealizedCastsPass());

PreProcessPasses

  // Start with a good clean
  CleanUp()
  // Preprocess convolutions.
  pm.addPass(createConvInitSimplifyPass());
  pm.addPass(createCleanupPass());
  pm.addPass(createRewriteConvToMatmulOrBrgemmPass());
  // Convert linalg.batch_matmul to linalg.matmul.
  pm.addPass(createRewriteBatchMatmulToMatmulPass());
  CleanUp()

PackPasses

  // Convert ops to packed layouts.
  pm.addPass(createPackConv2DNhwcHwcfPass({32, 32}));
  pm.addPass(createPackConv2DNchwFchwPass({32, 32}));
  pm.addPass(createPackMatmulPass({32, 32, 32}));
  pm.addPass(createPackVNNIPass());
  // Postprocess packing.
  // Run only canonicalizer at this stage as full cleanup (mostly CSE) can
  // mess up tensor producer-consumer chains used for analysis in the
  // following passes.
  pm.addPass(createPropagatePackUnPackPass());
  pm.addPass(createConstantFoldPackPass());
  pm.addPass(createSimplifyAndCanonicalizePackPass());
  CleanUp()

TileAndFusePasses

  pm.addPass(createTileConsumerAndFuseProducersPass());
  pm.addPass(createSimplifyAndCanonicalizePackPass());
  CleanUp()

Bufferize

  pm.addPass(createLowerPacksAndUnPacks());
  pm.addNestedPass<func::FuncOp>(createDecomposeAggregatedOpsPass());
  pm.addPass(createBufferizePass());
  CleanUp()

LinalgToLoops

  pm.addNestedPass<func::FuncOp>(createConvertLinalgToLoopsPass());
  CleanUp()

LinalgToXSMM

  pm.addPass(createConvertMemRefToXsmmPass());
  pm.addPass(createConvertLinalgToXsmmPass());
  CleanUp()

PostProcessing

  // Convert forAll to parallel loops should run after bufferization
  // as scf.parallel does not handle tensor.
  pm.addPass(createConvertForAllToParallelOpPass());
  // Covert all local TPP-related dialects.
  pm.addPass(createLocalDialectsLoweringPass());
  // Postprocess buffers.
  pm.addPass(bufferization::createBufferHoistingPass());
  CleanUp()

PartialLowering

  pm.addPass(memref::createExpandStridedMetadataPass());
  pm.addNestedPass<func::FuncOp>(tpp::createConvertPerfToLoopsPass());
  pm.addPass(tpp::createConvertPerfToFuncPass());
  pm.addPass(createConvertTensorToLinalgPass());
  pm.addNestedPass<func::FuncOp>(createConvertLinalgToLoopsPass());
  if (defParallel)
    pm.addPass(createConvertSCFToOpenMPPass());
  pm.addPass(createConvertVectorToSCFPass());
  pm.addPass(arith::createArithExpandOpsPass());
  pm.addPass(createLowerAffinePass());
  CleanUp()

FinalLowering

  // ALL
  pm.addPass(createConvertVectorToLLVMPass());
  pm.addPass(createFinalizeMemRefToLLVMConversionPass());
  pm.addPass(createConvertSCFToCFPass());
  if (defParallel)
    pm.addPass(createConvertOpenMPToLLVMPass());
  pm.addPass(createConvertMathToLLVMPass());

  // GPU
  pm.addNestedPass<func::FuncOp>(createGpuAsyncRegionPass());
  pm.addPass(createGpuToLLVMConversionPass());
  GpuModuleToBinaryPassOptions gpuModuleToBinaryPassOptions;
  gpuModuleToBinaryPassOptions.compilationTarget = "fatbin";
  pm.addPass(createGpuModuleToBinaryPass(gpuModuleToBinaryPassOptions));

  // CPU
  pm.addPass(createAsyncToAsyncRuntimePass());
  pm.addPass(createAsyncRuntimeRefCountingPass());
  pm.addPass(createConvertAsyncToLLVMPass());
  pm.addPass(createConvertFuncToLLVMPass());
  pm.addNestedPass<func::FuncOp>(createArithToLLVMConversionPass());

  // GPU
  pm.addPass(createConvertVulkanLaunchFuncToVulkanCallsPass());
  CleanUp()
chelini commented 1 year ago

Thanks some comment: Cleanup: I would not add pm.addPass(createReconcileUnrealizedCastsPass()); this is related to final LLVM lowering. No need to run at this stage. Same for pm.addPass(createTransformDropSchedulePass());. When we will add transform we need to see where to properly plug it in.

PreProcessPasses: Probably it is time to retire pm.addPass(createRewriteBatchMatmulToMatmulPass()); we expose BRGEMM using tile and fuse and we should really materialize loops only in one place. I don't think we need pm.addPass(createCleanupPass()); after pm.addPass(createConvInitSimplifyPass());.

Bufferize: maybe should we move this pass elsewhere pm.addPass(createLowerPacksAndUnPacks());?

The rest looks good. Adam has more context here.

adam-smnk commented 1 year ago

Looking at the past and the current state of the pipeline, I’d say that general or fine-grained bundling does not contribute to either maintainability or ease of use.

The most stable and useful bundles are these that define concrete one-shot transformations and often contribute to water-shedding e.g., bufferize (tensor to memref), outline gpu kernel (generic IR to GPU specific), convert to cuda/vulkan (generic GPU to target specific), all to LLVM (final lowering).

I doubt bundles like preprocess or even cleanup (see destructive conflict in the pack bundle) will compose well with various workloads. In the current pipeline individual passes are present in some places because they do not exactly fit specific bundles. This remains unaddressed.

Current bundling focuses too much on functional grouping. I’d suggest we try to determine useful stages of IR and try bundling such that we get to these stages. Example split (IR life stages?): Tensor input, pre-tiling transforms, tiled (and fused), transforms on tiles, bufferize, transforms on memref, outlining to backend/devices. This way it might be easier to position and inject our pipeline with respect to external tools. Of course, I’d prefer to avoid defining too many stages but the split should be driven by at least some use cases.

adam-smnk commented 1 year ago

Linalg to loops is hardly a bundle on its own, it’s an alternative testing path that imo should be part of whatever pipeline wants that route. Lower to loops has different intention within XSMM pipeline compared to GPU pipeline.

rengolin commented 1 year ago

Current bundling focuses too much on functional grouping. I’d suggest we try to determine useful stages of IR and try bundling such that we get to these stages. Example split (IR life stages?): Tensor input, pre-tiling transforms, tiled (and fused), transforms on tiles, bufferize, transforms on memref, outlining to backend/devices.

This is what I tried to do. Feel free to adjust to whatever is best in that line of thought, but with the hard constraint that we cannot have passes outside of bundles.

Cleanups are a problem. We want to take in canonical IR into our bundles, like other tools do. We either add a cleanup before or after each bundle. I don't mind which, but perhaps before is more stable.

Also, don't focus too much on other tooling. The main reason now to do this is to be able to write tests, check intermediate IR and monitor changes on our own project. Extra tooling will happen much later.

rengolin commented 1 year ago

Linalg to loops is hardly a bundle on its own, it’s an alternative testing path that imo should be part of whatever pipeline wants that route. Lower to loops has different intention within XSMM pipeline compared to GPU pipeline.

A simpler thing would be to "always" lower to loops just before final lowering. If we have already lowered to XSMM, the pass is a NOP, if not, lowering to loops happens. It also helps on linalg that we don't lower to XSMM, must also be lowered to loops eventually.

Modulo GPU of course. I don't understand the dependency of the passes. I expected the GPU pipeline to be completely different, not intermixed in the CPU pipeline. With bundles we can have two separate pipelines calling different bundles without duplicating logic.

adam-smnk commented 1 year ago

The main reason now to do this is to be able to write tests, check intermediate IR and monitor changes on our own project.

That is much better achieved with injecting printing stages as needed. Personally print-mlir is my go to for IR examination and we could add more useful printing stages.

I'd lean more toward just constructing and maintaining pipelines as we go. Some general purpose bundled like bufferize are useful but otherwise it is hardly ever useful to compose multiple bundles manually. Another example why I think bundles fail to add value is that I cannot or do not want to reuse most bundles from the default (CPU) pipeline in the GPU pipeline. These lowerings often need slight tweaks between individual passes.

rengolin commented 1 year ago

The set of constraints we have:

  1. Do not have analysis/transform passes outside of selectable bundles (perhaps not even cleanups).
  2. Do not have tpp-run with a different pipeline than tpp-opt with the same flags.
  3. Have tpp-run's "default pipeline" be the same as tpp-opt plus lowering.
  4. Have a composable set of passes that we can mix & match via command line or pass managers.
  5. Be able to consume and produce IR in between stages (opt 1 | opt 2 == opt 1 2)

We may not be able to construct a single pipeline for both CPU and GPU, but we should also not have CPU and GPU passes on the same bundle (like we have on the default-tpp-passes). This needs a thorough cleanup.

Rationale: