miniLM-Torch fails to run Vulkan due to failure in signless i64 casting to i32

raikonenfnu commented 2 years ago

miniLM-Torch fails to run on Vulkan backend. It breaks during benchmark-module due to an assertion of value throwing out:

iree-build-latest/tools/iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=/iree-build-latest/install/bin/lld --iree-mlir-to-vm-bytecode-module --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host -iree-vulkan-target-triple=m1-moltenvk-macos --iree-flow-demote-i64-to-i32  -iree-flow-demote-f64-to-f32 --iree-vulkan-target-triple=ampere-rtx3080-linux --mlir-print-ir-after-all --mlir-print-ir-after-change -mlir-elide-elementsattrs-if-larger=4 minilm_torch.mlir -o /tmp/minilm_torch.vmfb 2> /tmp/minilm_pass.mlir

iree-benchmark-module --driver=vulkan --entry_function=forward --function_input=1x128xi64 --module_file=/tmp/minilm_torch.vmfb
I iree/runtime/src/iree/tools/utils/vm_util.cc:154] Creating driver and device for 'vulkan'...
2022-05-27T22:50:02+00:00
Running /nodclouddata/iree-build-host/tools/iree-benchmark-module
Run on (12 X 2200.15 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 39424 KiB (x1)
Load Average: 1.20, 0.84, 0.38
<vm>:0: FAILED_PRECONDITION; mismatched size for broadcast;
[ 0] bytecode module.forward:3798 /home/minilm_torch.mlir:512:12
      at /home/minilm_torch.mlir:23:3
[1]    6864 abort (core dumped)

For additional context: -IR source -This bug also happens in cuda backend if I add the --iree-flow-demote-i64-to-i32 pass. (passes otherwise) -This error do not happen if I do not add the --iree-flow-demote-i64-to-i32 on vulkan backend.(It does run and show gpu utility, but it throws off other Vulkan Validation Errors at the very start of the benchmark-module call).

Hypothesis This leads me to believe that the error happens due to the pass, and specifically due to the casting failure for signless. Here is a sample of before and after the demotion. Before:

After:

It seems like that the problem is the demotion pass only looks at the Source Type(i64-signless) and Target Type (i32-signless). Hence it does not know how to cast the maximum value of int64 into int32(can be 2.1billion for signed and 4.2 billion for unsigned).

Potential Solution I am thinking we should add op-aware casting to the demotion/promotion. a.k.a Pass should check if the parent op is a signed or unsigned Op and cast appropriately. If Op has not specified signedness, then cast default like how it is done right now. Any suggestions on how to add this or other ideas to fix this is very welcomed! :)

raikonenfnu commented 2 years ago

EDIT: -Started writing a pattern to add into ConvertTypesPass that target typeSenstivie cmpiOp. However, did find another pattern that we might need to add.

raikonenfnu commented 2 years ago

I was able to fix these issues with a pass that checks if an op is signed, if it is I will pre-demote it to the appropriate value while keeping the type to the sourceType. Then feed this into the demotion pass. Please let me know if this is the right place and way to fix? :)

https://github.com/raikonenfnu/iree/commit/22e21da996c61a36eb99b2b968c633128d2ae113

antiagainst commented 2 years ago

Great findings! Yes it's the operation itself decides the signedness interpretation of its operands---i32/i64 is just a bag of 32/64 bits. We need to be aware of this to decide how to demote them. Though instead of creating a new pass, I think we can directly enhance the current ConvertPrimitiveType pass to account for such cases. We have existing patterns to convert ConvertTypeSensitiveArithCastOp, I think you can add something similar there by defining a template class and iterate over all constant operands and convert them.

raikonenfnu commented 2 years ago

Hey @antiagainst thanks for the input! I did have that in previous iteration and added the pattern to ConvertTypePass, but for some reason if I put the pattern there, the inputs that it receive would’ve been already demoted. I tried adding the new pattern at the very start and also tried putting the insert pattern as the very last pattern, but same issue persist. Any guidance on this would be appreciated :)

antiagainst commented 2 years ago

Yeah that's expected given this is using dialect conversion which involves type conversion. For such cases, ops convert from the top to the bottom---defining ops are converted before user ops. With the current implementation, we have the constant op already converted when looking at its user op. Though you are not forced to use the converted/demoted constant op (coming from OpAdaptor), you can entirely create new constant ops by looking at the original operand and demote that separately. Let me know if this makes sense.

raikonenfnu commented 2 years ago

I see, so we can create new constant op from the original operand and use that for the op instead(I think this is similar to how the patch I have is doing it). Would you happen to know how to obtain the original operand Value using the rewriter/OpConversion pattern within ConversionTypePass.

antiagainst commented 2 years ago

For example, for the signature here, you can do op.getOperands() to get the original operand values; not adaptor.getOperands().

raikonenfnu commented 2 years ago

Ohhhhh, that's awesome! will try this out! thanks @antiagainst :)

raikonenfnu commented 2 years ago

@antiagainst I have a new design for the patch that is a bit more conservative (i.e only target CmpIOp and SelectOp) and is part of ConvertPassType. Let me know what you think :) https://github.com/google/iree/compare/main...raikonenfnu:UpDemote

Although the code does look more lean if we just target any ops, and replace the operand without replacing the op. as seen here: https://github.com/nod-ai/SHARK-Runtime/pull/6/files#diff-6569382d386fb30c2578b42eda70505cc19df9de73c862f21b7aeb57a0375bfbR309-R326

powderluv commented 2 years ago

Is the recommendation here to make sure the we can map to i64 on the M1 and not worry about the demote ?

antiagainst commented 2 years ago

@powderluv: I think we should pursue all of these solutions (as Vulkan is really about supporting hardwares with different capabilities). 1) I've landed necessary fixes to make the model compile if i64 is natively supported. That should help M1. 2) I'll later add i64 emulation via i32. That should help all Vulkan targets not natively supporting i64. 3) @raikonenfnu's fix is good to have given I think it's a bugfix to the demotion passes. Keep all possibilities open and see which one gives you want you want. ;-P

powderluv commented 2 years ago

sounds good. I agree it is good to have this and also fix whatever needs fixing further upstream

antiagainst commented 3 months ago

closing as obsolete -- no actions planned now

iree-org / iree

miniLM-Torch fails to run Vulkan due to failure in signless i64 casting to i32 #9245