[Regression][CUDA] Increase in device peak memory usage in T5 Models

mariecwhite commented 1 year ago

What happened?

Error when running TF T5-Large Batch 512. Regression occurred somewhere between 6019731 and 75ea924

work/runtime/src/iree/hal/drivers/cuda/cuda_allocator.c:346: INTERNAL; CUDA driver error 'CUDA_ERROR_OUT_OF_MEMORY' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import; 
[ 1]   native hal.device.queue.alloca:0 -
[ 0] bytecode module.forward:4374 [
    /work/build-e2e-test-artifacts/e2e_test_artifacts/model_587e595d-2adf-4e41-9617-43178a133725-batch-512_T5LargeTFBatch512.mlirbc:1042:12
      at /work/build-e2e-test-artifacts/e2e_test_artifacts/model_587e595d-2adf-4e41-9617-43178a133725-batch-512_T5LargeTFBatch512.mlirbc:511:3,
    /work/build-e2e-test-artifacts/e2e_test_artifacts/model_587e595d-2adf-4e41-9617-43178a133725-batch-512_T5LargeTFBatch512.mlirbc:1069:12

Steps to reproduce your issue

gsutil cp gs://iree-github-actions-postsubmit-artifacts/5089129233/1/e2e-test-artifacts/iree_T5LargeTFBatch512_module_374d3219ba3a0064fc5eccb0857f9ae7c37ad2c8183c930b59b33a6a6248d109/module.vmfb /tmp

iree-benchmark-module --module=/tmp/module.vmfb --function=forward --input=512x512xi32=0 --input=512x512xi32=0 --device_allocator=caching --device=cuda://0

What component(s) does this issue relate to?

Runtime

Version information

Regression occurred somewhere between 6019731 and 75ea924

Additional context

No response

allieculp commented 1 year ago

@MaheshRavishankar I believe this was assigned for you to investigate last week? Let us know if you have an update or need to delegate.

MaheshRavishankar commented 1 year ago

I missed it.... Let me take a look

MaheshRavishankar commented 1 year ago

I just tried this with main and I didnt get any OOM memory error.

MaheshRavishankar commented 1 year ago

Couple of points here to help with debugging.

1) Please include compilation command as well (just the .vmfb is not enough to triage). Was able to figure it out from the linked Disable PR, so got what I wanted here. 2) I actually didnt have a VM setup with A100 (I went through the process of doing that to triage this error, so for future, I have what I need). Would be really useful to just run git bisect if possible to narrow down the failing commit.

Totally understand this might be a bit too much to ask, so I just went ahead and got setup to do this myself for future... but any help in triage would be deeply appreciated.

MaheshRavishankar commented 1 year ago

For now, moving to @mariecwhite to verify this is still an issue. Please pass it back to me if it still is an issue for you and maybe some more information about how to repro.

mariecwhite commented 1 year ago

Hi Mahesh, this is still happening: https://github.com/openxla/iree/actions/runs/5171496776/jobs/9315204607

Feel free to use my A100 machine for testing mariewhite-benchmark.

MaheshRavishankar commented 1 year ago

Hi Mahesh, this is still happening: openxla/iree/actions/runs/5171496776/jobs/9315204607

Feel free to use my A100 machine for testing mariewhite-benchmark.

Its not clear to me from the logs where to get the input model from. My VM got into a bad state, so I got my VM working now... If I can get a pointer to where to download the input model (*.mlirbc file), that'd help. I am guessing I tried the wrong file last time.

mariecwhite commented 1 year ago

Sorry the repro steps aren't very clear. File to download: https://storage.googleapis.com/iree-model-artifacts/tensorflow/tf_models_2.12.0_1683544084/T5_LARGE/batch_512/hlo.mlirbc

iree-compile --iree-hal-target-backends=cuda --iree-input-type=stablehlo --iree-hal-cuda-llvm-target-arch=sm_80 </path/to/model.mlirbc> -o </path/to/model.vmfb>

iree-benchmark-module --module=</path/to/model.vmfb> --function=forward --input=512x512xi32=0 --input=512x512xi32=0 --device_allocator=caching --device=cuda://0

MaheshRavishankar commented 1 year ago

Ok, I can reproduce it now.... Running bisection.

MaheshRavishankar commented 1 year ago

Did this ever work? I tried https://github.com/openxla/iree/commit/60197311ff9c46ec886582ed0f0b1c0d9ab07503 and https://github.com/openxla/iree/commit/a8a70fb2d and still OOM error.

mariecwhite commented 1 year ago

We were getting latencies for this workload until May 25:

mariecwhite commented 1 year ago

Do you mind trying with a commit between May 12 and May 24?

MaheshRavishankar commented 1 year ago

Tried https://github.com/openxla/iree/commit/92f985916 . Still failing.

mariecwhite commented 1 year ago

I wonder if the runners have been updated. @pzread @GMNGeoffrey did the GPU runners get updated around May 25?

GMNGeoffrey commented 1 year ago

It sounds like you repro'ed on your own machine? How would the runners be implicated? Or is this still using artifacts from the runners? GPU runners had VM and docker images updated this week for a Cuda upgrade, but that was after you encountered this. I updated the version of the GitHub Actions runner about that time as part of https://github.com/openxla/iree/issues/13350, but I can't think of a way that could be related... I can't think of any other asynchronous updates that happened here

mariecwhite commented 1 year ago

Mahesh can repro the error on commits before and after the error was observed. This signals an issue outside of IREE e.g. execution environment has changed, so I was wondering if the runners changed.

GMNGeoffrey commented 1 year ago

Ah so a latent bug that was triggered by a change to the runners? Yeah I think the only thing was the GitHub runner version. Irritatingly, we can't even test with the old runner version because the GitHub control plane will refuse to talk to old versions after the next version has been out for 30 days.

GMNGeoffrey commented 1 year ago

Maybe look at changes to the benchmark suite definitions?

mariecwhite commented 1 year ago

Digging into this a bit more and there looks to have been a regression in memory usage for most T5-Large models around that time, though strangely batch 512 got fixed:

ResNet50 and BertLarge are have been stable.

mariecwhite commented 1 year ago

To retrieve memory statistics, add --print_statistics=true to the iree-benchmark-module command.

MaheshRavishankar commented 1 year ago

(SLow iterating here cause I am building on a slow VM across integrate commits)... but I just ran a commit from before moving to stablehlo (still compiling which is in itself a concern)... but maybe that has something to do with it.

MaheshRavishankar commented 1 year ago

Gave up on trying to see if this ever worked. I am assuming it never worked. There is just one transient alloca here and its 60 GB. So obviously doesnt fit in memory. FYI @benvanik for visibility... Ill try to track what can help reduce that (large model, need to dump the whole IR)

benvanik commented 1 year ago

if you're already dumping things if you could grab the output after --compile-to=flow with elide attrs I could quickly triage to see if any of that is going to be fixed in my current work 🤞

MaheshRavishankar commented 1 year ago

I have a 170 MB log...

MaheshRavishankar commented 1 year ago

I am getting the log for flow. Ill upload as gist. @benvanik want to hop on a call?

benvanik commented 1 year ago

yeah give me 1 min

MaheshRavishankar commented 1 year ago

Here is the Flow gist https://gist.github.com/MaheshRavishankar/ac98a8b1ba7033349ed70d2cadfdc119

MaheshRavishankar commented 1 year ago

Updating some more info on the issue here. This model seems to have the following pattern

    %collapsed_81 = tensor.collapse_shape %883 [[0, 1], [2], [3]] : tensor<512x16x512x512xf32> into tensor<8192x512x512xf32>
    %collapsed_82 = tensor.collapse_shape %888 [[0, 1], [2], [3]] : tensor<512x16x512x64xf32> into tensor<8192x512x64xf32>
    %889 = tensor.empty() : tensor<8192x512x64xf32>
    %890 = linalg.fill ins(%cst_0 : f32) outs(%889 : tensor<8192x512x64xf32>) -> tensor<8192x512x64xf32>
    %891 = linalg.batch_matmul ins(%collapsed_81, %collapsed_82 : tensor<8192x512x512xf32>, tensor<8192x512x64xf32>) outs(%890 : tensor<8192x512x64xf32>) -> tensor<8192x512x64xf32>
    %expanded_83 = tensor.expand_shape %891 [[0, 1], [2], [3]] : tensor<8192x512x64xf32> into tensor<512x16x512x64xf32>
    %892 = tensor.empty() : tensor<512x512x16x64xf32>
    %893 = linalg.generic {indexing_maps = [#map11, #map3], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_83 : tensor<512x16x512x64xf32>) outs(%892 : tensor<512x512x16x64xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<512x512x16x64xf32>

where the 16 extent dimension is first folded into the batch dimension of the linalg.batch_matmul and then transposed into an inner dimension. I initially thought it was einsum lowering, but the input IR itself has this (which might well be einsum lowered this way by the user). These are 8GB tensor materialized. It will take some work to reduce the memory overhead here... possibly a long-ish running task...

@allieculp I am still working on this, but is it P0 ?

aaron-schneider commented 1 year ago

ping @MaheshRavishankar @mariecwhite - do you think this is not P0 (and not a release-blocker)?

MaheshRavishankar commented 1 year ago

yeah I dont think it is a release blocker. This was happening on training graphs AFAIk and not on inference workloads. I will let @mariecwhite confirm. Its still P1 since we found the core issue, but that will need some work to fix.

mariecwhite commented 1 year ago

This is actually happening on inference T5 graphs and looking at the latest plots, it still hasn't been resolved:

MaheshRavishankar commented 1 year ago

No it hasnt been. Question is what is the batch size we need to focus on in the short term (there is a long term issue here which is not as easy to resolve, so I will have to drop something else if this is needed as a release blocker).

ScottTodd commented 4 months ago

Given no action in almost a year (despite the release-blocker label), I'm going to close this regression as stale/obsolete.

iree-org / iree