openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.69k stars 430 forks source link

Explore performance of XLA:CPU on ARM. #1667

Open pifon2a opened 1 year ago

pifon2a commented 1 year ago

@sherhut @d0k @jreiffers

It would be interesting to benchmark XLA:CPU Next on ARM. I am starting this issue to track the progress and also to share information about the code location.

XLA:CPU uses MLIR tiling/fusion/vectorization transformations that exist in both OpenXLA and TF repos.

1. XLA:CPU compiler contains two important parts

2. Tiling, fusion and vectorization.

CpuTilingPipeline finds fusion clusters e.g. map(matmul(transpose)), reduce(map); tiles the root, fuses all consumers in and then vectorizes or scalarizes the loop bodies. There are many tests that fuse tHLO/Linalg ops in tests/Dialect/gml_st/cpu_tiling. This pipeline has options that affect tile sizes.

3. Vector optimizations and lowering to SCF.

LowerVectorsPass is launched after bufferization. It rewrites higher-level vector ops, e.g. vector.contract, vector.multi_reduction; optimizes vector.transfer_read/write ops and then lowers the result to SCF by unrolling the vectors.

4. Enabling MLIR pipeline for AOT compilation.

tf_library rule should have mlir_components set to "HloLowering".

jreiffers commented 1 year ago

tf_library rule should have mlir_components set to "HloLowering".

Or alternatively, depend on the implicitly defined MLIR library (name + '_mlir' suffix).

RoboTux commented 1 year ago

FWIW, as of 3693f68ceb32cf15ed1c1d2f5b7d88890fcd6af9 I still got a >10x slowdown when running BERT from MLperf with python run.py --backend=tf --scenario SingleStream using XLA-MLIR (TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime") Vs the default.

Looking at htop the default uses all the cores heavily, whereas it's all very quiet on all cores when using XLA. Is there some obvious flag I'm missing? Any suggested approach to narrow down the issue?

Thanks in advance, Thomas

jon-chuang commented 1 year ago

running BERT from MLperf with python run.py --backend=tf --scenario SingleStream using XLA-MLIR

Would you mind providing a script/instructions to reproduce this? I'm guessing this issue only appears on ARM?

pifon2a commented 1 year ago

@RoboTux is it 10x slowdown compared to XLA:CPU Current or just TF? XLA:CPU Next/Current are single-threaded only, that might be the problem.

RoboTux commented 1 year ago

Hi there,

Sorry for the late reply. I was comparing default TF (no XLA) with XLA:CPU-Next with auto partitioning on a Graviton 3 system with 16 core (AWS c7g.4xlarge instance). As per your answer I tried in a single threaded setting and the difference is then down to 3x.

To reproduce I just built TF pip package locally, install them, clone mlcommons/inference.git repository and run TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime" python run.py --backend=tf --scenario SingleStream Vs TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 python run.py --backend=tf --scenario SingleStream