Open pifon2a opened 1 year ago
tf_library rule should have
mlir_components
set to "HloLowering".
Or alternatively, depend on the implicitly defined MLIR library (name + '_mlir' suffix).
FWIW, as of 3693f68ceb32cf15ed1c1d2f5b7d88890fcd6af9 I still got a >10x slowdown when running BERT from MLperf with python run.py --backend=tf --scenario SingleStream
using XLA-MLIR (TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime") Vs the default.
Looking at htop the default uses all the cores heavily, whereas it's all very quiet on all cores when using XLA. Is there some obvious flag I'm missing? Any suggested approach to narrow down the issue?
Thanks in advance, Thomas
running BERT from MLperf with python run.py --backend=tf --scenario SingleStream using XLA-MLIR
Would you mind providing a script/instructions to reproduce this? I'm guessing this issue only appears on ARM?
@RoboTux is it 10x slowdown compared to XLA:CPU Current or just TF? XLA:CPU Next/Current are single-threaded only, that might be the problem.
Hi there,
Sorry for the late reply. I was comparing default TF (no XLA) with XLA:CPU-Next with auto partitioning on a Graviton 3 system with 16 core (AWS c7g.4xlarge instance). As per your answer I tried in a single threaded setting and the difference is then down to 3x.
To reproduce I just built TF pip package locally, install them, clone mlcommons/inference.git repository and run TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime" python run.py --backend=tf --scenario SingleStream
Vs TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 python run.py --backend=tf --scenario SingleStream
@sherhut @d0k @jreiffers
It would be interesting to benchmark XLA:CPU Next on ARM. I am starting this issue to track the progress and also to share information about the code location.
XLA:CPU uses MLIR tiling/fusion/vectorization transformations that exist in both OpenXLA and TF repos.
1. XLA:CPU compiler contains two important parts
HloXlaRuntimePipeline MLIR pipeline that goes from HLO to Linalg + tHLO, then performs tiling/fusion and buffer allocation/optimizations and emits structured control flow with scalars, vectors and memrefs.
XlaCpuCompilationPipeline that lowers the result of
hlo-xla-runtime-pipeline
to LLVM.2. Tiling, fusion and vectorization.
CpuTilingPipeline finds fusion clusters e.g.
map(matmul(transpose))
,reduce(map)
; tiles the root, fuses all consumers in and then vectorizes or scalarizes the loop bodies. There are many tests that fuse tHLO/Linalg ops in tests/Dialect/gml_st/cpu_tiling. This pipeline has options that affect tile sizes.3. Vector optimizations and lowering to SCF.
LowerVectorsPass is launched after bufferization. It rewrites higher-level vector ops, e.g.
vector.contract
,vector.multi_reduction
; optimizesvector.transfer_read/write
ops and then lowers the result to SCF by unrolling the vectors.4. Enabling MLIR pipeline for AOT compilation.
tf_library rule should have
mlir_components
set to "HloLowering".