[TF 2.14][aarch64]Memory footprint increased by almost 2.5x for inference (eg: I've tested MLPerf Resnet50 offline mode)

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

TF2.14

Custom code

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

on a machine with around 32GB memory (for example, AWS c7g.4xl), mlperf Resnet50 offline inference fails with OutOfMemory on TF 2.14 and nightly wheels. The same benchmark works fine on TF 2.13.

I've root-caused the issue to the following commit that introduced inter op scheduler to improve performance for models with parallel ops. While this is improving the perf by 15% for MLPerf Resnet50 batch mode on r7g.16xl, it is increasing the memory footprint by 2.5x (from 25GB to 67GB).

commit d0cb12441747ef9fb14137cb99f0b6a17e22b5e4
Author: David Svantesson <david.svantesson@arm.com>
Date:   Tue Jul 25 09:33:40 2023 -0700

    PR #61235: Add inter scheduler support on AArch64

    Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/61235

    This PR adds support for inter op scheduler in the oneDNN + ACL build. It enables the creation of more than 1 scheduler inside ACL to increase performance of models with parallel ops.
    For benchmarked NLP models the average performance increase is 9%, for CV classification models its around 2%.
    The below benchmarks were done with the following PR’s applied as patches:
    #60026, #60723, #61110, #61114, #61093, #61123

We need to reduce the memory footprint or let the max limit be set at runtime, something similar to LRU cache capacity.

Standalone code to reproduce the issue

# install MLcomons inference repo
cd $HOME
git clone https://github.com/mlcommons/inference.git
cd inference
git checkout v2.0

cd inference/loadgen
CFLAGS="-std=c++14" python3 setup.py bdist_wheel
pip3 install dist/*.whl

# download the resnet50 model and the dataset 
wget https://zenodo.org/record/2535873/files/resnet50_v1.pb

ck pull repo:ck-env
echo 0 | ck install package --tags=image-classification,dataset,imagenet,aux
echo 1 | ck install package --tags=image-classification,dataset,imagenet,val
cp /CK-TOOLS/dataset-imagenet-ilsvrc2012-aux-from.berkeley/val.txt \
     /CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min/val_map.txt

# Run resnet50 inference in offline mode
export DATA_DIR=/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min
export MODEL_DIR=$HOME/
cd $HOME/inference/vision/classification_and_detection$ ./run_local.sh tf resnet50 cpu --scenario=Offline

Relevant log output

INFO:main:starting TestScenario.Offline
    ./run_local.sh: line 13: 50519 Killed                  python python/main.py --profile $profile $common_opt --model $model_path $dataset --output $OUTPUT_DIR $EXTRA_OPS $@

tensorflow / tensorflow