llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.7k stars 11.39k forks source link

[mlir][sparse] Calling mlir-opt with sparsifier's dumped pass pipeline for GPU codegen does not emit GPU code #91774

Open ggeorgakoudis opened 3 months ago

ggeorgakoudis commented 3 months ago

First call mlir-opt for the sparsifier to generate GPU code and dump the pass pipeline (I'm omitting the mlir input file, reproducers contain it):

mlir-opt --sparsifier="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71 gpu-format=llvm" --dump-pass-pipeline

Reproducer: https://godbolt.org/z/7cP8z1qcY

Then use the dumped pipeline directly in mlir-opt:

mlir-opt -pass-pipeline="builtin.module(func.func(linalg-generalize-named-ops),func.func(linalg-fuse-elementwise-ops),sparsification-and-bufferization,sparse-storage-specifier-to-llvm,func.func(canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=true test-convergence=false top-down=true}),func.func(finalizing-bufferize),sparse-gpu-codegen{enable-runtime-library=true num-threads=1024},gpu.module(strip-debuginfo),gpu.module(convert-scf-to-cf),gpu.module(convert-gpu-to-nvvm{has-redux=false index-bitwidth=0 use-bare-ptr-memref-call-conv=false}),func.func(convert-linalg-to-loops),func.func(convert-vector-to-scf{full-unroll=false lower-tensors=false target-rank=1}),func.func(expand-realloc{emit-deallocs=true}),func.func(convert-scf-to-cf),expand-strided-metadata,lower-affine,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},finalize-memref-to-llvm{index-bitwidth=0 use-aligned-alloc=false use-generic-functions=false},func.func(convert-complex-to-standard),func.func(arith-expand{include-bf16=false}),func.func(convert-math-to-llvm{approximate-log1p=true}),convert-math-to-libm,convert-complex-to-libm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-complex-to-llvm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-func-to-llvm{index-bitwidth=0 use-bare-ptr-memref-call-conv=false},nvvm-attach-target{O=2 chip=sm_80 fast=false features=+ptx71 ftz=false  module= triple=nvptx64-nvidia-cuda},gpu-to-llvm{gpu-binary-annotation=gpu.binary use-bare-pointers-for-host=false use-bare-pointers-for-kernels=false},gpu-module-to-binary{format=llvm  opts= toolkit=},reconcile-unrealized-casts)"

Reproducer: https://godbolt.org/z/1zz64j895

Expected to see to same the same GPU codegen as with callng the sparsifier but output does not contain GPU code

llvmbot commented 3 months ago

@llvm/issue-subscribers-mlir-sparse

Author: Giorgis Georgakoudis (ggeorgakoudis)

First call `mlir-opt` for the sparsifier to generate GPU code and dump the pass pipeline (I'm omitting the mlir input file, reproducers contain it): ``` mlir-opt --sparsifier="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71 gpu-format=llvm" --dump-pass-pipeline ``` Reproducer: https://godbolt.org/z/7cP8z1qcY Then use the dumped pipeline directly in `mlir-opt`: ``` mlir-opt -pass-pipeline="builtin.module(func.func(linalg-generalize-named-ops),func.func(linalg-fuse-elementwise-ops),sparsification-and-bufferization,sparse-storage-specifier-to-llvm,func.func(canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=true test-convergence=false top-down=true}),func.func(finalizing-bufferize),sparse-gpu-codegen{enable-runtime-library=true num-threads=1024},gpu.module(strip-debuginfo),gpu.module(convert-scf-to-cf),gpu.module(convert-gpu-to-nvvm{has-redux=false index-bitwidth=0 use-bare-ptr-memref-call-conv=false}),func.func(convert-linalg-to-loops),func.func(convert-vector-to-scf{full-unroll=false lower-tensors=false target-rank=1}),func.func(expand-realloc{emit-deallocs=true}),func.func(convert-scf-to-cf),expand-strided-metadata,lower-affine,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},finalize-memref-to-llvm{index-bitwidth=0 use-aligned-alloc=false use-generic-functions=false},func.func(convert-complex-to-standard),func.func(arith-expand{include-bf16=false}),func.func(convert-math-to-llvm{approximate-log1p=true}),convert-math-to-libm,convert-complex-to-libm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-complex-to-llvm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-func-to-llvm{index-bitwidth=0 use-bare-ptr-memref-call-conv=false},nvvm-attach-target{O=2 chip=sm_80 fast=false features=+ptx71 ftz=false module= triple=nvptx64-nvidia-cuda},gpu-to-llvm{gpu-binary-annotation=gpu.binary use-bare-pointers-for-host=false use-bare-pointers-for-kernels=false},gpu-module-to-binary{format=llvm opts= toolkit=},reconcile-unrealized-casts)" ``` Reproducer: https://godbolt.org/z/1zz64j895 Expected to see to same the same GPU codegen as with callng the sparsifier but output does not contain GPU code
llvmbot commented 3 months ago

@llvm/issue-subscribers-mlir-gpu

Author: Giorgis Georgakoudis (ggeorgakoudis)

First call `mlir-opt` for the sparsifier to generate GPU code and dump the pass pipeline (I'm omitting the mlir input file, reproducers contain it): ``` mlir-opt --sparsifier="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71 gpu-format=llvm" --dump-pass-pipeline ``` Reproducer: https://godbolt.org/z/7cP8z1qcY Then use the dumped pipeline directly in `mlir-opt`: ``` mlir-opt -pass-pipeline="builtin.module(func.func(linalg-generalize-named-ops),func.func(linalg-fuse-elementwise-ops),sparsification-and-bufferization,sparse-storage-specifier-to-llvm,func.func(canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=true test-convergence=false top-down=true}),func.func(finalizing-bufferize),sparse-gpu-codegen{enable-runtime-library=true num-threads=1024},gpu.module(strip-debuginfo),gpu.module(convert-scf-to-cf),gpu.module(convert-gpu-to-nvvm{has-redux=false index-bitwidth=0 use-bare-ptr-memref-call-conv=false}),func.func(convert-linalg-to-loops),func.func(convert-vector-to-scf{full-unroll=false lower-tensors=false target-rank=1}),func.func(expand-realloc{emit-deallocs=true}),func.func(convert-scf-to-cf),expand-strided-metadata,lower-affine,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},finalize-memref-to-llvm{index-bitwidth=0 use-aligned-alloc=false use-generic-functions=false},func.func(convert-complex-to-standard),func.func(arith-expand{include-bf16=false}),func.func(convert-math-to-llvm{approximate-log1p=true}),convert-math-to-libm,convert-complex-to-libm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-complex-to-llvm,convert-vector-to-llvm{enable-amx=false enable-arm-neon=false enable-arm-sve=false enable-x86vector=false force-32bit-vector-indices=true reassociate-fp-reductions=false},convert-func-to-llvm{index-bitwidth=0 use-bare-ptr-memref-call-conv=false},nvvm-attach-target{O=2 chip=sm_80 fast=false features=+ptx71 ftz=false module= triple=nvptx64-nvidia-cuda},gpu-to-llvm{gpu-binary-annotation=gpu.binary use-bare-pointers-for-host=false use-bare-pointers-for-kernels=false},gpu-module-to-binary{format=llvm opts= toolkit=},reconcile-unrealized-casts)" ``` Reproducer: https://godbolt.org/z/1zz64j895 Expected to see to same the same GPU codegen as with callng the sparsifier but output does not contain GPU code