Tensor cores not utilised when using `iree-run-module --device=cuda`

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

http://iree.dev/

Apache License 2.0

2.84k stars 612 forks source link

Tensor cores not utilised when using `iree-run-module --device=cuda` #11887

Open navdeepkk opened 1 year ago

navdeepkk commented 1 year ago

Hi all,

Following the instructions here (https://iree-org.github.io/iree/deployment-configurations/gpu-cuda-rocm/), I am trying to run ResNet50 using IREE command line tools downloaded via PIP. However, upon profiling the model using Nsight compute, I see that the model is not using tensor cores.

Is there a flag/env_var that needs to be set to enable tensor cores? Any suggestion would be appreciated.

Thanks

Package versions:

iree-compiler                 20220930.282
iree-runtime                  20220930.282
iree-tools-tf                 20220930.282
iree-tools-tflite             20220930.282
iree-tools-xla                20220930.282

ThomasRaoux commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

rsuderman commented 1 year ago

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

navdeepkk-polymagelabs commented 1 year ago

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

Thanks. We are doing a performance comparison and using the git HEAD isn't ideal. Is there a recommended most recent stable/performing commit to use?

Also, can the python packages please be updated here https://pypi.org/project/iree-tools-tf/? Though the release date of these are shown to be Nov, the package name itself says that it is from 30/09/2022.

navdeepkk-polymagelabs commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Thanks. Is there a stability/performance reason these passes aren't enabled by default? The reason I'm asking is that we are doing a performance comparison and we'd like to use a uniform and standard set of flags across all models as much as possible.

stellaraccident commented 1 year ago

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).
More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

navdeepkk-polymagelabs commented 1 year ago

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).

More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

Thanks for clarifying this. Sounds good.

navdeepkk-polymagelabs commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:


  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

ThomasRaoux commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

navdeepkk-polymagelabs commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

ThomasRaoux commented 1 year ago

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:
  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```
Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"
Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

We don't have a flag to automatically demote operands from fp32 to fp16. Could you share the mhlo IR?

navdeepkk-polymagelabs commented 1 year ago

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-

iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>

Default policy mixed-float16

navdeepkk-polymagelabs commented 1 year ago

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-
iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>
Default policy mixed-float16

Hi @ThomasRaoux, is there a way I can still use tensor cores for this IR?

allieculp commented 1 year ago

@ThomasRaoux Bumping this up, can you take a look?

ThomasRaoux commented 1 year ago

@ThomasRaoux Bumping this up, can you take a look?

Sorry for missing this issue. We need implicit gemm support for this to happen.

allieculp commented 1 year ago

@ThomasRaoux @mattwalsh Setting as a P2 since we don't yet have the implicit gemm support for this - please bump up when needed.