Compiling bidaf-9.onnx takes 170GB of memory

onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure

Apache License 2.0

760 stars 319 forks source link

Compiling bidaf-9.onnx takes 170GB of memory #2722

Closed cjvolzka closed 6 months ago

cjvolzka commented 8 months ago

When I attempt to compile the bidaf-9 model from the onnx model zoo, compiling stops after about 7 minutes with no information.

Watching memory usage during compiling, it uses about 300mb upt to about 5 min. After that, it starts to grow reaching just short of 60Gb before it gets killed at 7 min, presumably by the Linux OOM Killer as my system runs out of memory.

cjvolzka commented 8 months ago

@negiyas reported he was able to successfully compile the model but it took 170Gb of memory.

@imaihal reported if LLVM patch https://reviews.llvm.org/D148487 is applied, memory usage caps at 1GB and takes about 5 min.

@tungld do you have bandwidth to see if we can get your llvm patch merged into llvm to fix the issue?

gongsu832 commented 8 months ago

I thought we fixed this problem originally observed in https://github.com/onnx/onnx-mlir/issues/2084 last year but I guess @tungld's LLVM patch was reverted due to some problem?

tungld commented 8 months ago

@gongsu832 yes, I did the LLVM patch but it somehow caused flang in llvm failed, so it was reverted.

python3kgae commented 8 months ago

Hi all, I modified @tungld's LLVM patch so it doesn't crash the repro in https://github.com/llvm/llvm-project/issues/62802 which cause the patch being reverted.

Could anyone help to test if it still helps bidaf-9 model? (I don't know how to setup onnx-mlir to run an onnx model :( )

Here's the modified LLVM patch. dangling-const.patch I'll create a pull request to llvm repo once we can confirm the patch helps.

tungld commented 8 months ago

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

tungld commented 8 months ago

@python3kgae I checked bidaf-9 with your patch, and memory consumption was peak at around 1.7 GB. So it does help bidaf-9. Thank you very much @python3kgae!

python3kgae commented 8 months ago

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

I'm using old LLVM code to test the old repro. I'll change to recent LLVM code when create pull request to LLVM repo.

python3kgae commented 8 months ago

Pull request created https://github.com/llvm/llvm-project/pull/82708

python3kgae commented 8 months ago

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

gongsu832 commented 8 months ago

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

SIMD related code typically only works on s390x Linux so failure on Windows isn't surprising. @AlexandreEichenberger should be able to provide more definitive answer since he wrote most of the SIMD code.

python3kgae commented 8 months ago

Created a PR to create only one globalOp for all strings in a string literal https://github.com/onnx/onnx-mlir/pull/2727

This could save a lot of time when debugging bidaf-9 model.

AlexandreEichenberger commented 8 months ago

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

I believe that this happens because on Windows, we run with warning as error. If you don't mind, probably just adding =0

 int64_t estimatedSimdLoopTripCount = 0;

here https://github.com/onnx/onnx-mlir/blob/01c5c9fb536a43cde36abccf562bb2f6cb594cb4/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L490 would fix the problem.

In general, SIMD works on x86 Linux, got to assume it does to for Window.

python3kgae commented 8 months ago

The fix is in https://github.com/llvm/llvm-project/commit/c11627c2f4d550613a3cb360c89a0cf52d2eb720

tungld commented 8 months ago

@python3kgae thanks so much!!!

python3kgae commented 8 months ago

@python3kgae thanks so much!!!

Thank you for create this project :)

cjvolzka commented 6 months ago

Closing as this was fixed by recent llvm uplift.