Open aviator19941 opened 4 months ago
lol I'm guessing something is multiplying by a dynamic dimension (sentinel -1) without checking :P
(to reproduce we'll need the batch_llama_3_8B.mlir file, or the entire contents of the @prefill_bs4$async_dispatch_1_generic_4xDx4096_i64xf32
flow.executable
/hal.executable
op prior to the error )
Yeah I'll upload it here, accidentally submitted the issue before uploading it here :)
@benvanik I uploaded a zip that has the batch_llama_3_8B.mlir file
It looks like it failed in SetEncoding (or related passes). @pashu123 given that you want to get more involved in these tasks, would you like to triage the issue when you're available?
@aviator19941 Do we need to cherry-pick some commit or checkout branch? On main branch I am noticing this
batch_llama_3_8B.mlir:1003:12: error: 'flow.tensor.reshape' op operand #2 must be variadic of index, but got 'i64'
%339 = torch.aten.view %333, %338 : !torch.vtensor<[4,?,32,128],f32>, !torch.list<int> -> !torch.vtensor<[4,?,32,64,2],f32>
^
batch_llama_3_8B.mlir:1003:12: note: see current operation: %352 = "flow.tensor.reshape"(%331, %305, %351) <{operandSegmentSizes = array<i32: 1, 1, 1>}> : (tensor<4x?x4096xf32>, index, i64) -> tensor<4x?x32x64x2xf32>
We need to cherry-pick this https://github.com/iree-org/iree/pull/17182 for the 1st command to work.
Here's the minimal repro https://gist.github.com/pashu123/45fe64caa21cfdfa9890698660184a44
This is failing in the // -----// IR Dump After GPUCheckResourceUsage Failed (iree-codegen-gpu-check-resource-usage) //----- //
@aviator19941 The failure is due to https://gist.github.com/pashu123/020217a35f1c643ed03b169ce41f68d9 (embedding kernel). It has a cast from fp16 -> fp32. Please double-check that it's a full fp16 model. Also, could you post how to obtain the IRs?
A possible optimization that can be thought of is
%0 = torch.prims.convert_element_type %arg1, %int6 : !torch.vtensor<[128256,4096],f16>, !torch.int -> !torch.vtensor<[128256,4096],f32>
%1 = torch.aten.embedding %0, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f32>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f32>
return %1 : !torch.vtensor<[4,?,4096],f32>
can be replaced by
%0 = torch.aten.embedding %arg1, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f16>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f16>
return %1 : !torch.vtensor<[4,?,4096],f16>
%1 = torch.prims.convert_element_type %0, %int6 : !torch.vtensor<[4,?,4096],f16>, !torch.int -> !torch.vtensor<[4,?,4096],f32>
i.e., we don't need to cast the entire embedding matrix; we just cast what we want out of the matrix. The above repro takes forever to compile on the CPU backend. However, when we apply the optimization, we don't get the error: func.func op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
.
It is not compiled because vector.gather is lowered to a lot of vector ops -- which should be fixed.
The other issue is that we are having two generic ops and they are not fused in TileAndFuse. Because there are no operands dependency between two generic ops. It should be fixed before sending it to codegen. I don't have a good solution so far. Perhaps we should just disable the fusion for this kind of case. @MaheshRavishankar do you have any suggestions?
func.func @decode_bs4$async_dispatch_0_generic_4xDx4096_i64xf32() {
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = arith.extui %0 : i32 to i64
%3 = arith.extui %1 : i32 to i64
%4 = arith.shli %3, %c32_i64 : i64
%5 = arith.ori %2, %4 : i64
%6 = arith.index_castui %5 : i64 to index
%7 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<128256x4096xf16>>
%8 = flow.dispatch.workload.ordinal %6, 0 : index
%9 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x?xi64>>{%8}
%10 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>{%8}
%11 = flow.dispatch.tensor.load %7, offsets = [0, 0], sizes = [128256, 4096], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128256x4096xf16>> -> tensor<128256x4096xf16>
%12 = flow.dispatch.tensor.load %9, offsets = [0, 0], sizes = [4, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<4x?xi64>>{%8} -> tensor<4x?xi64>
%13 = tensor.empty(%8) : tensor<4x?x4096xf32>
%14 = tensor.empty() : tensor<128256x4096xf32>
%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11 : tensor<128256x4096xf16>) outs(%14 : tensor<128256x4096xf32>) {
^bb0(%in: f16, %out: f32):
%17 = arith.extf %in : f16 to f32
linalg.yield %17 : f32
} -> tensor<128256x4096xf32>
%16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) {
^bb0(%in: i64, %out: f32):
%17 = arith.index_cast %in : i64 to index
%18 = linalg.index 2 : index
%extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32>
linalg.yield %extracted : f32
} -> tensor<4x?x4096xf32>
flow.dispatch.tensor.store %16, %10, offsets = [0, 0, 0], sizes = [4, %8, 4096], strides = [1, 1, 1] : tensor<4x?x4096xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>{%8}
return
}
@aviator19941 The failure is due to https://gist.github.com/pashu123/020217a35f1c643ed03b169ce41f68d9 (embedding kernel). It has a cast from fp16 -> fp32. Please double-check that it's a full fp16 model. Also, could you post how to obtain the IRs?
When I try to set the activation and attention dtypes to fp16 here, I run into
convertScalarToDtype should handle all the types
UNREACHABLE executed at iree/third_party/torch-mlir/lib/Conversion/Utils/Utils.cpp:355!
because it is trying to multiply complex<f16>
and complex<f32>
(repro). So I think it has to do with some dtype in the model that should be fp16, but is not.
@aviator19941 The failure is due to https://gist.github.com/pashu123/020217a35f1c643ed03b169ce41f68d9 (embedding kernel). It has a cast from fp16 -> fp32. Please double-check that it's a full fp16 model. Also, could you post how to obtain the IRs?
In order to obtain the IR's:
@aviator19941 The failure is due to https://gist.github.com/pashu123/020217a35f1c643ed03b169ce41f68d9 (embedding kernel). It has a cast from fp16 -> fp32. Please double-check that it's a full fp16 model. Also, could you post how to obtain the IRs?
When I try to set the activation and attention dtypes to fp16 here, I run into
convertScalarToDtype should handle all the types UNREACHABLE executed at iree/third_party/torch-mlir/lib/Conversion/Utils/Utils.cpp:355!
because it is trying to multiply
complex<f16>
andcomplex<f32>
(repro). So I think it has to do with some dtype in the model that should be fp16, but is not.
I think I can add the fix for this. It is required to enable the full Fp16 precision model.
@aviator19941 You can get the latest fp16 IR from wget https://huggingface.co/prashantk/test_files/resolve/main/batch_llama_v1.mlir?download=true
.
It's able to generate the .vmfb setting llvm-cpu backend with the command
iree-compile -iree-input-type=torch --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host batch_llama_v1.mlir -iree-opt-demote-i64-to-i32 -o llama3.vmfb
You need to cherry-pick https://github.com/iree-org/iree/pull/17247
I think there are still action items in the issue, the look-up table fusion is scaring me. We should fix that at least. The tile sizes for vector.gather are problematic. They will be fully unrolled, which looks really bad.
I think there are still action items in the issue, the look-up table fusion is scaring me. We should fix that at least. The tile sizes for vector.gather are problematic. They will be fully unrolled, which looks really bad.
I never intended to close the issue; I don't know if it got closed automatically. Yes, for the mixed precision case in which we have activations represented as f32, we still have action items to do.
Confirmed that the fusion is not expected. @MaheshRavishankar will fix it.
For the gather codegen issue, @pashu123 could you create a input case for the generic op and see what's happening? I'm expecting that some dimensions would be collapsed, and the next issue could be tile size selection. https://github.com/iree-org/iree/pull/17227 could help, but there could other issues remaining on the table.
%16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) {
^bb0(%in: i64, %out: f32):
%17 = arith.index_cast %in : i64 to index
%18 = linalg.index 2 : index
%extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32>
linalg.yield %extracted : f32
} -> tensor<4x?x4096xf32>
Confirmed that the fusion is not expected. @MaheshRavishankar will fix it.
For the gather codegen issue, @pashu123 could you create a input case for the generic op and see what's happening? I'm expecting that some dimensions would be collapsed, and the next issue could be tile size selection. #17227 could help, but there could other issues remaining on the table.
%16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) { ^bb0(%in: i64, %out: f32): %17 = arith.index_cast %in : i64 to index %18 = linalg.index 2 : index %extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32> linalg.yield %extracted : f32 } -> tensor<4x?x4096xf32>
Well, fusion is not expected cause I wasnt looking at it properly. It is expected and I think it is probably what you want at the dispatch level. If we dont fuse this we will materialize a tensor of size 128256x4096x4
bytes which is completely unnecessary.
The real issue though is that the op shouldnt be lowered this way. A better representation of this would be to do
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2 : tensor<4x?xi64>) outs(%3 : tensor<4x?x4096xf32>) {
^bb0(%in: i64, %out: f32):
%9 = arith.index_cast %in : i64 to index
%10 = linalg.index 2 : index
%extracted = tensor.extract %5[%9, %10] : tensor<128256x4096xf16>
%extracted_f32 = arith.extf %extracted : f16 to f32
linalg.yield %extracted_f32 : f32
} -> tensor<4x?x4096xf32>
That should fix one of the issue Hanhan mentioned. If we can fix the front end to do this that would be best. If not, then we should just write an ad-hoc pattern that does this kind of fusion. There is really nothing structured about this to generalize here. This is just a specific pattern which is just a WAR to a front-end lowering issue.
A possible optimization that can be thought of is
%0 = torch.prims.convert_element_type %arg1, %int6 : !torch.vtensor<[128256,4096],f16>, !torch.int -> !torch.vtensor<[128256,4096],f32> %1 = torch.aten.embedding %0, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f32>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f32> return %1 : !torch.vtensor<[4,?,4096],f32>
can be replaced by
%0 = torch.aten.embedding %arg1, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f16>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f16> return %1 : !torch.vtensor<[4,?,4096],f16> %1 = torch.prims.convert_element_type %0, %int6 : !torch.vtensor<[4,?,4096],f16>, !torch.int -> !torch.vtensor<[4,?,4096],f32>
i.e., we don't need to cast the entire embedding matrix; we just cast what we want out of the matrix. The above repro takes forever to compile on the CPU backend. However, when we apply the optimization, we don't get the error:
func.func op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
.
@MaheshRavishankar, does this sound reasonable to add to torch-mlir?
A possible optimization that can be thought of is
%0 = torch.prims.convert_element_type %arg1, %int6 : !torch.vtensor<[128256,4096],f16>, !torch.int -> !torch.vtensor<[128256,4096],f32> %1 = torch.aten.embedding %0, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f32>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f32> return %1 : !torch.vtensor<[4,?,4096],f32>
can be replaced by
%0 = torch.aten.embedding %arg1, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f16>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f16> return %1 : !torch.vtensor<[4,?,4096],f16> %1 = torch.prims.convert_element_type %0, %int6 : !torch.vtensor<[4,?,4096],f16>, !torch.int -> !torch.vtensor<[4,?,4096],f32>
i.e., we don't need to cast the entire embedding matrix; we just cast what we want out of the matrix. The above repro takes forever to compile on the CPU backend. However, when we apply the optimization, we don't get the error:
func.func op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
.@MaheshRavishankar, does this sound reasonable to add to torch-mlir?
Added here: https://github.com/llvm/torch-mlir/pull/3277
Not sure why this keeps closing
A possible optimization that can be thought of is
%0 = torch.prims.convert_element_type %arg1, %int6 : !torch.vtensor<[128256,4096],f16>, !torch.int -> !torch.vtensor<[128256,4096],f32> %1 = torch.aten.embedding %0, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f32>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f32> return %1 : !torch.vtensor<[4,?,4096],f32>
can be replaced by
%0 = torch.aten.embedding %arg1, %arg0, %int-1, %false_0, %false : !torch.vtensor<[128256,4096],f16>, !torch.vtensor<[4,?],si64>, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[4,?,4096],f16> return %1 : !torch.vtensor<[4,?,4096],f16> %1 = torch.prims.convert_element_type %0, %int6 : !torch.vtensor<[4,?,4096],f16>, !torch.int -> !torch.vtensor<[4,?,4096],f32>
i.e., we don't need to cast the entire embedding matrix; we just cast what we want out of the matrix. The above repro takes forever to compile on the CPU backend. However, when we apply the optimization, we don't get the error:
func.func op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
.@MaheshRavishankar, does this sound reasonable to add to torch-mlir?
FYI, if you can make the torch embedding lookup good, that is best. But also I carved this out for a potential special op: it would be trivial to write a custom op at the frontend that expanded to whatever linalg you want.
Not sure why this keeps closing
@pashu123 put a "fixes" command in a commit message and now anyone who has write access to the repo will close it when they merge in that commit to their forks of whatever :P https://github.com/aartbik/torch-mlir/commit/8c48135a426b84fa412b031fc92e12826ff60b31
With that said, moving the cast across the embedding lookup is a common optimization.
I'm a bit worried that the default path on this generates basically unusable code, though.
With that said, moving the cast across the embedding lookup is a common optimization.
I'm a bit worried that the default path on this generates basically unusable code, though.
That's fair, but we just don't represent gathers well. And if we clone the quantization into all its used dispatches (as we do now under current understanding of best way to handle dequantization) none of the transformations can actually fuse and generate this code. The producer consumer dependency only materializes from within the body of the consumer. Nothing accounts for that and it just falls off the cliff
Not sure why this keeps closing
@pashu123 put a "fixes" command in a commit message and now anyone who has write access to the repo will close it when they merge in that commit to their forks of whatever :P aartbik/torch-mlir@8c48135
Why is Github unable to prevent actions on forks from spamming main repos... Seems like a big anti-feature.
@aviator19941 Do you have instructions on how to run llama3 for the IREE backend?
With that said, moving the cast across the embedding lookup is a common optimization.
I'm a bit worried that the default path on this generates basically unusable code, though.
More I think about this, it might be worth just doing the fusion of
%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11 : tensor<128256x4096xf16>) outs(%14 : tensor<128256x4096xf32>) {
^bb0(%in: f16, %out: f32):
%17 = arith.extf %in : f16 to f32
linalg.yield %17 : f32
} -> tensor<128256x4096xf32>
%16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) {
^bb0(%in: i64, %out: f32):
%17 = arith.index_cast %in : i64 to index
%18 = linalg.index 2 : index
%extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32>
linalg.yield %extracted : f32
} -> tensor<4x?x4096xf32>
to
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2 : tensor<4x?xi64>) outs(%3 : tensor<4x?x4096xf32>) {
^bb0(%in: i64, %out: f32):
%9 = arith.index_cast %in : i64 to index
%10 = linalg.index 2 : index
%extracted = tensor.extract %5[%9, %10] : tensor<128256x4096xf16>
%extracted_f32 = arith.extf %extracted : f16 to f32
linalg.yield %extracted_f32 : f32
} -> tensor<4x?x4096xf32>
as a one-off canonicalization for now to not fall off a cliff. Might be hard to make it future proof, but more examples will help. @IanWood1 just FYI for something for us to discuss (and for you to pick up as a simple task). Please make sure we chat about this next time we sync.
Agreed at handling even if not generalized as it's pretty catastrophic to clone embeddings.
I think the more durable fix may be proper propagation: we should sink any exts down/hoist truncs up across memcpy-like ops (such as this gather or a scatter). We may with the current logic be in a better situation but still want to ensure we don't materialize ext/trunc dispatches unless absolutely required.
Noting that this issue also occurs with some other models. In the SHARK-TestSuite, the onnx/models/RAFT_vaiq_int8 also encounters a similar issue. To reproduce, set up the test suite, and run
python run.py --cachedir=/path/to/.cache/ -t onnx/models/RAFT_vaiq_int8/ -m onnx -c /path/to/torch-mlir/build/ -i /path/to/iree-build/ --torchtolinalg
with an up-to-date torch-mlir and iree build.
batch_llama_3_8B.zip
What happened?
When trying to compile this mlir file, I get the shared memory error below:
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
f2746b464fb056ddadef4315654d59f727e4c9b0
Additional context
No response