Closed mariecwhite closed 4 months ago
@MaheshRavishankar I believe this was assigned for you to investigate last week? Let us know if you have an update or need to delegate.
I missed it.... Let me take a look
I just tried this with main and I didnt get any OOM memory error.
Couple of points here to help with debugging.
1) Please include compilation command as well (just the .vmfb
is not enough to triage). Was able to figure it out from the linked Disable PR, so got what I wanted here.
2) I actually didnt have a VM setup with A100 (I went through the process of doing that to triage this error, so for future, I have what I need). Would be really useful to just run git bisect
if possible to narrow down the failing commit.
Totally understand this might be a bit too much to ask, so I just went ahead and got setup to do this myself for future... but any help in triage would be deeply appreciated.
For now, moving to @mariecwhite to verify this is still an issue. Please pass it back to me if it still is an issue for you and maybe some more information about how to repro.
Hi Mahesh, this is still happening: https://github.com/openxla/iree/actions/runs/5171496776/jobs/9315204607
Feel free to use my A100 machine for testing mariewhite-benchmark
.
Hi Mahesh, this is still happening: openxla/iree/actions/runs/5171496776/jobs/9315204607
Feel free to use my A100 machine for testing
mariewhite-benchmark
.
Its not clear to me from the logs where to get the input model from. My VM got into a bad state, so I got my VM working now... If I can get a pointer to where to download the input model (*.mlirbc
file), that'd help. I am guessing I tried the wrong file last time.
Sorry the repro steps aren't very clear. File to download: https://storage.googleapis.com/iree-model-artifacts/tensorflow/tf_models_2.12.0_1683544084/T5_LARGE/batch_512/hlo.mlirbc
iree-compile --iree-hal-target-backends=cuda --iree-input-type=stablehlo --iree-hal-cuda-llvm-target-arch=sm_80 </path/to/model.mlirbc> -o </path/to/model.vmfb>
iree-benchmark-module --module=</path/to/model.vmfb> --function=forward --input=512x512xi32=0 --input=512x512xi32=0 --device_allocator=caching --device=cuda://0
Ok, I can reproduce it now.... Running bisection.
Did this ever work? I tried https://github.com/openxla/iree/commit/60197311ff9c46ec886582ed0f0b1c0d9ab07503 and https://github.com/openxla/iree/commit/a8a70fb2d and still OOM error.
We were getting latencies for this workload until May 25:
Do you mind trying with a commit between May 12 and May 24?
Tried https://github.com/openxla/iree/commit/92f985916 . Still failing.
I wonder if the runners have been updated. @pzread @GMNGeoffrey did the GPU runners get updated around May 25?
It sounds like you repro'ed on your own machine? How would the runners be implicated? Or is this still using artifacts from the runners? GPU runners had VM and docker images updated this week for a Cuda upgrade, but that was after you encountered this. I updated the version of the GitHub Actions runner about that time as part of https://github.com/openxla/iree/issues/13350, but I can't think of a way that could be related... I can't think of any other asynchronous updates that happened here
Mahesh can repro the error on commits before and after the error was observed. This signals an issue outside of IREE e.g. execution environment has changed, so I was wondering if the runners changed.
Ah so a latent bug that was triggered by a change to the runners? Yeah I think the only thing was the GitHub runner version. Irritatingly, we can't even test with the old runner version because the GitHub control plane will refuse to talk to old versions after the next version has been out for 30 days.
Maybe look at changes to the benchmark suite definitions?
Digging into this a bit more and there looks to have been a regression in memory usage for most T5-Large models around that time, though strangely batch 512 got fixed:
ResNet50 and BertLarge are have been stable.
To retrieve memory statistics, add --print_statistics=true
to the iree-benchmark-module
command.
(SLow iterating here cause I am building on a slow VM across integrate commits)... but I just ran a commit from before moving to stablehlo
(still compiling which is in itself a concern)... but maybe that has something to do with it.
Gave up on trying to see if this ever worked. I am assuming it never worked. There is just one transient alloca here and its 60 GB. So obviously doesnt fit in memory. FYI @benvanik for visibility... Ill try to track what can help reduce that (large model, need to dump the whole IR)
if you're already dumping things if you could grab the output after --compile-to=flow with elide attrs I could quickly triage to see if any of that is going to be fixed in my current work 🤞
I have a 170 MB log...
I am getting the log for flow. Ill upload as gist. @benvanik want to hop on a call?
yeah give me 1 min
Here is the Flow gist https://gist.github.com/MaheshRavishankar/ac98a8b1ba7033349ed70d2cadfdc119
Updating some more info on the issue here. This model seems to have the following pattern
%collapsed_81 = tensor.collapse_shape %883 [[0, 1], [2], [3]] : tensor<512x16x512x512xf32> into tensor<8192x512x512xf32>
%collapsed_82 = tensor.collapse_shape %888 [[0, 1], [2], [3]] : tensor<512x16x512x64xf32> into tensor<8192x512x64xf32>
%889 = tensor.empty() : tensor<8192x512x64xf32>
%890 = linalg.fill ins(%cst_0 : f32) outs(%889 : tensor<8192x512x64xf32>) -> tensor<8192x512x64xf32>
%891 = linalg.batch_matmul ins(%collapsed_81, %collapsed_82 : tensor<8192x512x512xf32>, tensor<8192x512x64xf32>) outs(%890 : tensor<8192x512x64xf32>) -> tensor<8192x512x64xf32>
%expanded_83 = tensor.expand_shape %891 [[0, 1], [2], [3]] : tensor<8192x512x64xf32> into tensor<512x16x512x64xf32>
%892 = tensor.empty() : tensor<512x512x16x64xf32>
%893 = linalg.generic {indexing_maps = [#map11, #map3], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_83 : tensor<512x16x512x64xf32>) outs(%892 : tensor<512x512x16x64xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<512x512x16x64xf32>
where the 16
extent dimension is first folded into the batch dimension of the linalg.batch_matmul
and then transposed into an inner dimension. I initially thought it was einsum lowering, but the input IR itself has this (which might well be einsum
lowered this way by the user). These are 8GB tensor materialized. It will take some work to reduce the memory overhead here... possibly a long-ish running task...
@allieculp I am still working on this, but is it P0 ?
ping @MaheshRavishankar @mariecwhite - do you think this is not P0 (and not a release-blocker
)?
yeah I dont think it is a release blocker. This was happening on training graphs AFAIk and not on inference workloads. I will let @mariecwhite confirm. Its still P1 since we found the core issue, but that will need some work to fix.
This is actually happening on inference T5 graphs and looking at the latest plots, it still hasn't been resolved:
No it hasnt been. Question is what is the batch size we need to focus on in the short term (there is a long term issue here which is not as easy to resolve, so I will have to drop something else if this is needed as a release blocker).
Given no action in almost a year (despite the release-blocker label), I'm going to close this regression as stale/obsolete.
What happened?
Error when running TF T5-Large Batch 512. Regression occurred somewhere between 6019731 and 75ea924
Steps to reproduce your issue
What component(s) does this issue relate to?
Runtime
Version information
Regression occurred somewhere between 6019731 and 75ea924
Additional context
No response