Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
TF2.14
Custom code
No
OS platform and distribution
Linux Ubuntu 22.04
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
on a machine with around 32GB memory (for example, AWS c7g.4xl), mlperf Resnet50 offline inference fails with OutOfMemory on TF 2.14 and nightly wheels. The same benchmark works fine on TF 2.13.
I've root-caused the issue to the following commit that introduced inter op scheduler to improve performance for models with parallel ops. While this is improving the perf by 15% for MLPerf Resnet50 batch mode on r7g.16xl, it is increasing the memory footprint by 2.5x (from 25GB to 67GB).
commit d0cb12441747ef9fb14137cb99f0b6a17e22b5e4
Author: David Svantesson <david.svantesson@arm.com>
Date: Tue Jul 25 09:33:40 2023 -0700
PR #61235: Add inter scheduler support on AArch64
Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/61235
This PR adds support for inter op scheduler in the oneDNN + ACL build. It enables the creation of more than 1 scheduler inside ACL to increase performance of models with parallel ops.
For benchmarked NLP models the average performance increase is 9%, for CV classification models its around 2%.
The below benchmarks were done with the following PR’s applied as patches:
#60026, #60723, #61110, #61114, #61093, #61123
We need to reduce the memory footprint or let the max limit be set at runtime, something similar to LRU cache capacity.
Standalone code to reproduce the issue
# install MLcomons inference repo
cd $HOME
git clone https://github.com/mlcommons/inference.git
cd inference
git checkout v2.0
cd inference/loadgen
CFLAGS="-std=c++14" python3 setup.py bdist_wheel
pip3 install dist/*.whl
# download the resnet50 model and the dataset
wget https://zenodo.org/record/2535873/files/resnet50_v1.pb
ck pull repo:ck-env
echo 0 | ck install package --tags=image-classification,dataset,imagenet,aux
echo 1 | ck install package --tags=image-classification,dataset,imagenet,val
cp /CK-TOOLS/dataset-imagenet-ilsvrc2012-aux-from.berkeley/val.txt \
/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min/val_map.txt
# Run resnet50 inference in offline mode
export DATA_DIR=/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min
export MODEL_DIR=$HOME/
cd $HOME/inference/vision/classification_and_detection$ ./run_local.sh tf resnet50 cpu --scenario=Offline
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
TF2.14
Custom code
No
OS platform and distribution
Linux Ubuntu 22.04
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
on a machine with around 32GB memory (for example, AWS c7g.4xl), mlperf Resnet50 offline inference fails with OutOfMemory on TF 2.14 and nightly wheels. The same benchmark works fine on TF 2.13.
I've root-caused the issue to the following commit that introduced inter op scheduler to improve performance for models with parallel ops. While this is improving the perf by 15% for MLPerf Resnet50 batch mode on r7g.16xl, it is increasing the memory footprint by 2.5x (from 25GB to 67GB).
We need to reduce the memory footprint or let the max limit be set at runtime, something similar to LRU cache capacity.
Standalone code to reproduce the issue
Relevant log output