Open navdeepkk opened 1 year ago
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80
for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.
I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.
Thanks. We are doing a performance comparison and using the git HEAD isn't ideal. Is there a recommended most recent stable/performing commit to use?
Also, can the python packages please be updated here https://pypi.org/project/iree-tools-tf/? Though the release date of these are shown to be Nov, the package name itself says that it is from 30/09/2022.
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set
--iree-hal-cuda-llvm-target-arch=sm_80
for matmul op.Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 --iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Thanks. Is there a stability/performance reason these passes aren't enabled by default? The reason I'm asking is that we are doing a performance comparison and we'd like to use a uniform and standard set of flags across all models as much as possible.
The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.
The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:
We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.
The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.
The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:
- CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).
- More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.
We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.
Thanks for clarifying this. Sounds good.
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set
--iree-hal-cuda-llvm-target-arch=sm_80
for matmul op.Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 --iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at 6a59ff602
and these flags are absent in iree-compile
. The only flags that I found with flow
keyword were:
--iree-flow-demote-f32-to-f16 - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
--iree-flow-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
--iree-flow-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
--iree-flow-dispatch-generate-workload-region - Generate the workload region
--iree-flow-dispatch-use-transform-dialect=<string> - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
--iree-flow-dump-dispatch-graph - Dump a dot graph for dispatches
--iree-flow-dump-dispatch-graph-output-file=<string> - Output file name for a dispatch graph dump
--iree-flow-enable-aggressive-fusion - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
--iree-flow-enable-data-tiling - Enable data tiling path
--iree-flow-enable-fuse-padding-into-linalg-consumer-ops - Enable fusing tensor.pad ops into Linalg consumer ops
--iree-flow-export-benchmark-funcs - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
--iree-flow-inline-constants-max-byte-length=<int> - Maximum byte-length of constant that can be inlined into a dispatch region
--iree-flow-normalize-input-indexing-map - Enable normalizing input indexing map to identity
--iree-flow-promote-f16-to-f32 - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
--iree-flow-split-matmul-reduction=<long> - split ratio
--iree-flow-topk-split-reduction=<long> - comma separated list of split ratios
--iree-flow-trace-dispatch-tensors - Trace runtime input/output tensors for each dispatch function.
--iree-flow-zero-fill-empty-tensors ```
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set
--iree-hal-cuda-llvm-target-arch=sm_80
for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 --iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at
6a59ff602
and these flags are absent iniree-compile
. The only flags that I found withflow
keyword were:--iree-flow-demote-f32-to-f16 - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions. --iree-flow-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions. --iree-flow-dispatch-generate-workload-region - Generate the workload region --iree-flow-dispatch-use-transform-dialect=<string> - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions. --iree-flow-dump-dispatch-graph - Dump a dot graph for dispatches --iree-flow-dump-dispatch-graph-output-file=<string> - Output file name for a dispatch graph dump --iree-flow-enable-aggressive-fusion - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops --iree-flow-enable-data-tiling - Enable data tiling path --iree-flow-enable-fuse-padding-into-linalg-consumer-ops - Enable fusing tensor.pad ops into Linalg consumer ops --iree-flow-export-benchmark-funcs - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments. --iree-flow-inline-constants-max-byte-length=<int> - Maximum byte-length of constant that can be inlined into a dispatch region --iree-flow-normalize-input-indexing-map - Enable normalizing input indexing map to identity --iree-flow-promote-f16-to-f32 - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-split-matmul-reduction=<long> - split ratio --iree-flow-topk-split-reduction=<long> - comma separated list of split ratios --iree-flow-trace-dispatch-tensors - Trace runtime input/output tensors for each dispatch function. --iree-flow-zero-fill-empty-tensors ```
Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is:
--iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set
--iree-hal-cuda-llvm-target-arch=sm_80
for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 --iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at
6a59ff602
and these flags are absent iniree-compile
. The only flags that I found withflow
keyword were:--iree-flow-demote-f32-to-f16 - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions. --iree-flow-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions. --iree-flow-dispatch-generate-workload-region - Generate the workload region --iree-flow-dispatch-use-transform-dialect=<string> - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions. --iree-flow-dump-dispatch-graph - Dump a dot graph for dispatches --iree-flow-dump-dispatch-graph-output-file=<string> - Output file name for a dispatch graph dump --iree-flow-enable-aggressive-fusion - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops --iree-flow-enable-data-tiling - Enable data tiling path --iree-flow-enable-fuse-padding-into-linalg-consumer-ops - Enable fusing tensor.pad ops into Linalg consumer ops --iree-flow-export-benchmark-funcs - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments. --iree-flow-inline-constants-max-byte-length=<int> - Maximum byte-length of constant that can be inlined into a dispatch region --iree-flow-normalize-input-indexing-map - Enable normalizing input indexing map to identity --iree-flow-promote-f16-to-f32 - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-split-matmul-reduction=<long> - split ratio --iree-flow-topk-split-reduction=<long> - comma separated list of split ratios --iree-flow-trace-dispatch-tensors - Trace runtime input/output tensors for each dispatch function. --iree-flow-zero-fill-empty-tensors ```
Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is:
--iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"
Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution
op. I see in the mhlo
input all operands are fp32
. Is there a flag again that was missed and is required to do the casting of the operands to fp16
and execute in mixed-precision mode on the tensor cores?
To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set
--iree-hal-cuda-llvm-target-arch=sm_80
for matmul op. Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 --iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at
6a59ff602
and these flags are absent iniree-compile
. The only flags that I found withflow
keyword were:--iree-flow-demote-f32-to-f16 - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions. --iree-flow-demote-f64-to-f32 - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-demote-i64-to-i32 - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions. --iree-flow-dispatch-generate-workload-region - Generate the workload region --iree-flow-dispatch-use-transform-dialect=<string> - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions. --iree-flow-dump-dispatch-graph - Dump a dot graph for dispatches --iree-flow-dump-dispatch-graph-output-file=<string> - Output file name for a dispatch graph dump --iree-flow-enable-aggressive-fusion - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops --iree-flow-enable-data-tiling - Enable data tiling path --iree-flow-enable-fuse-padding-into-linalg-consumer-ops - Enable fusing tensor.pad ops into Linalg consumer ops --iree-flow-export-benchmark-funcs - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments. --iree-flow-inline-constants-max-byte-length=<int> - Maximum byte-length of constant that can be inlined into a dispatch region --iree-flow-normalize-input-indexing-map - Enable normalizing input indexing map to identity --iree-flow-promote-f16-to-f32 - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions. --iree-flow-split-matmul-reduction=<long> - split ratio --iree-flow-topk-split-reduction=<long> - comma separated list of split ratios --iree-flow-trace-dispatch-tensors - Trace runtime input/output tensors for each dispatch function. --iree-flow-zero-fill-empty-tensors ```
Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is:
--iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"
Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to
mhlo.convolution
op. I see in themhlo
input all operands arefp32
. Is there a flag again that was missed and is required to do the casting of the operands tofp16
and execute in mixed-precision mode on the tensor cores?
We don't have a flag to automatically demote operands from fp32 to fp16. Could you share the mhlo IR?
Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-
iree_model.mlir:285:22: error: expected SSA operand
%4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>
Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-
iree_model.mlir:285:22: error: expected SSA operand %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>
Hi @ThomasRaoux, is there a way I can still use tensor cores for this IR?
@ThomasRaoux Bumping this up, can you take a look?
@ThomasRaoux Bumping this up, can you take a look?
Sorry for missing this issue. We need implicit gemm support for this to happen.
@ThomasRaoux @mattwalsh Setting as a P2 since we don't yet have the implicit gemm support for this - please bump up when needed.
Hi all,
Following the instructions here (https://iree-org.github.io/iree/deployment-configurations/gpu-cuda-rocm/), I am trying to run ResNet50 using IREE command line tools downloaded via PIP. However, upon profiling the model using Nsight compute, I see that the model is not using tensor cores.
Is there a flag/env_var that needs to be set to enable tensor cores? Any suggestion would be appreciated.
Thanks
Package versions: