iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 611 forks source link

[VULKAN] Numerical error (zeros) on loading and operating on large int8 constants #14675

Open PhaneeshB opened 1 year ago

PhaneeshB commented 1 year ago

What happened?

Context: this is a min repro from compilation of llama2 model with 7B params in size int8 on vulkan on Nvidia A100 40G gpu (ubuntu22.04) the compilation proceeds on creation of vmfb without any errors but the result of executing the vmfb is zeros and nans. this repro is the first occurrence of zeros in the mlir.

the function:

func.func @first_vicuna_forward(%arg0: tensor<1x?xi64>) -> (tensor<4096x32x128xf32>, tensor<4096x32x128xi8>, tensor<4096x32x1xf32>) {
    %c1 = ml_program.global_load_const @c1 : index
      %cst_670 = ml_program.global_load_const @cst_670 : tensor<4096x4096xi8>
    %cst_668 = ml_program.global_load_const @cst_668 : tensor<4096x32x1xf32>
    %cst_669 = ml_program.global_load_const @cst_669 : tensor<4096x32x1xf32>
    %cst_741 = ml_program.global_load_const @cst_741 : f32
    %dim = tensor.dim %arg0, %c1 : tensor<1x?xi64>
    %expanded_752 = tensor.expand_shape %cst_670 [[0], [1, 2]] : tensor<4096x4096xi8> into tensor<4096x32x128xi8>
    %39 = tensor.empty() : tensor<4096x32x128xf32>  
    %40 = tensor.empty(%dim) : tensor<1x?x4096xf32>
    %41 = linalg.fill ins(%cst_741 : f32) outs(%40 : tensor<1x?x4096xf32>) -> tensor<1x?x4096xf32>
    %42 = linalg.generic {indexing_maps = [#map2_0, #map12_0, #map2_0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%expanded_752, %cst_668 : tensor<4096x32x128xi8>, tensor<4096x32x1xf32>) outs(%39 : tensor<4096x32x128xf32>) {
    ^bb0(%in: i8, %in_1652: f32, %out: f32):
    %2150 = arith.extui %in : i8 to i32
    %2151 = arith.uitofp %2150 : i32 to f32
    %2152 = arith.subf %2151, %in_1652 : f32
    linalg.yield %2152 : f32
    } -> tensor<4096x32x128xf32>
    return %42, %expanded_752, %cst_668: tensor<4096x32x128xf32>, tensor<4096x32x128xi8>, tensor<4096x32x1xf32>
  }

the result in %42 is what comes out to be zeros. I've validated that %2150 has non zero values and %2151 != %in_1652.


On compiling with a debug build some vulkan validation errors are shown as follows (although compilation to vmfb is successful):

[VULKAN] ! Validation Error: [ VUID-VkShaderModuleCreateInfo-pCode-08740 ] 
Object 0: handle = 0x55836e675340, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x6e224e9 | 
vkCreateShaderModule(): The SPIR-V Capability (StorageBuffer8BitAccess) was declared, but 
none of the requirements were met to use it. The Vulkan spec states: If pname:codeType is 
ename:VK_SHADER_CODE_TYPE_SPIRV_EXT, and pCode declares any of the capabilities listed in 
the SPIR-V Environment appendix, one of the corresponding requirements must be satisfied 
(https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-VkShaderModuleCreateInfo-pCode-08740)

[VULKAN] ! Validation Error: [ VUID-RuntimeSpirv-storageBuffer8BitAccess-06328 ] 
Object 0: handle = 0xcad092000000000d, type = VK_OBJECT_TYPE_SHADER_MODULE; |
 MessageID = 0xbbd18a7e | vkCreateShaderModule(): storageBuffer8BitAccess is not enabled, but shader contains an 8-bit OpVariable with StorageBuffer Storage Class.
%11 = OpVariable %6 12 The Vulkan spec states: If storageBuffer8BitAccess is VK_FALSE, then objects containing an 8-bit integer element must not have {StorageClass} of StorageBuffer, ShaderRecordBufferKHR, or PhysicalStorageBuffer (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-RuntimeSpirv-storageBuffer8BitAccess-06328)

[VULKAN] ! Validation Error: [ VUID-vkCmdDispatch-groupCountY-00387 ] Object 0: handle = 0x55836e576a20, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xfb655004 | vkCmdDispatch(): groupCountY (131072) exceeds device limit maxComputeWorkGroupCount[1] (65535). The Vulkan spec states: groupCountY must be less than or equal to VkPhysicalDeviceLimits::maxComputeWorkGroupCount[1] (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-groupCountY-00387)

llama2-debug-a100-vulkaninfo.txt

Steps to reproduce your issue

mlir

compile command :

./iree-compile \
--iree-input-type=tm_tensor \
--iree-vm-bytecode-module-output-format=flatbuffer-binary \
--iree-hal-target-backends=vulkan --mlir-print-debuginfo \
--mlir-print-op-on-diagnostic=false \
--iree-llvmcpu-target-cpu-features=host \
--iree-vulkan-target-env="#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control, VK_NV_cooperative_matrix], NVIDIA:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 49152, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 32, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 32, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, cooperativeMatrixPropertiesNV = [#vk.coop_matrix_props<mSize = 8, nSize = 8, kSize = 32, aType = i8, bType = i8, cType = i32, resultType = i32, scope = #vk.scope<Subgroup>>, #vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f16, resultType = f16, scope = #vk.scope<Subgroup>>, #vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f32, resultType = f32, scope = #vk.scope<Subgroup>>], shaderIntegerDotProduct = unit >>" \
--iree-stream-resource-index-bits=64 \
--iree-vm-target-index-bits=64 \
--iree-vm-bytecode-module-strip-source-map=true \
--iree-util-zero-fill-elided-attrs \
--iree-vm-target-truncate-unsupported-floats \
--iree-codegen-check-ir-before-llvm-conversion=false \
--iree-opt-const-expr-hoisting=False \
--iree-flow-dump-dispatch-graph=1 \
--iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 ./llama2_7b_int8_new_cutfv_upgrade_sub.mlir -o ./op_llama2_7b_int8.vmfb

run command:

iree-run-module --module=./op_llama2_7b_int8.vmfb --device=vulkan --function=first_vicuna_f
orward --input="1x16xi64=1"  > ./outputs/op_llama2_7b_int8.txt

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

PhaneeshB commented 1 year ago

From some info on maxComputeWorkGroupCount link

In an attempt to debug the Validation Error: [ VUID-vkCmdDispatch-groupCountY-00387 ] tried adding the capability maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, to the target env string iree-compile crashes with the following stack dump (possibly due to maxComputeWorkGroupSize not being an expected entry to target env string) :

Stack dump:
0.      Program arguments: ../IREE/iree-build/tools/iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host "--iree-vulkan-target-env=#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control, VK_NV_cooperative_matrix], NVIDIA:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 49152, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, maxComputeWorkGroupCount = dense<[2147483647, 65535, 65535]>: vector<3xi32>, subgroupSize = 32, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 32, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, cooperativeMatrixPropertiesNV = [#vk.coop_matrix_props<mSize = 8, nSize = 8, kSize = 32, aType = i8, bType = i8, cType = i32, resultType = i32, scope = #vk.scope<Subgroup>>, #vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f16, resultType = f16, scope = #vk.scope<Subgroup>>, #vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f32, resultType = f32, scope = #vk.scope<Subgroup>>], shaderIntegerDotProduct = unit >>" --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False --iree-flow-dump-dispatch-graph=1 --iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 /home/phaneesh/SHARK/llama2_7b_int8_new_cutfv_upgrade_sub.mlir -o ./vmfbs/cut_at42_upsub_maxworkgroup_llama2_7b_int8.vmfb
 #0 0x00007f902ea3d6f7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:602:13
 #1 0x00007f902ea3bb00 llvm::sys::RunSignalHandlers() /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/lib/Support/Signals.cpp:105:18
 #2 0x00007f902ea3dd8a SignalHandler(int) /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1
 #3 0x00007f9029642520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007f9029696a7c pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x96a7c)
 #5 0x00007f9029642476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
 #6 0x00007f90296287f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
 #7 0x00007f902962871b (/lib/x86_64-linux-gnu/libc.so.6+0x2871b)
 #8 0x00007f9029639e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
 #9 0x00007f902ea84cb4 mlir::NamedAttribute::NamedAttribute(mlir::StringAttr, mlir::Attribute) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/lib/IR/Attributes.cpp:46:3
#10 0x00007f902fbd355c mlir::NamedAttribute& llvm::SmallVectorImpl<mlir::NamedAttribute>::emplace_back<mlir::StringAttr, mlir::spirv::TargetEnvAttr&>(mlir::StringAttr&&, mlir::spirv::TargetEnvAttr&) /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:0:0
#11 0x00007f902fbd355c mlir::iree_compiler::IREE::HAL::VulkanSPIRVTargetBackend::getExecutableTarget(mlir::MLIRContext*, mlir::spirv::TargetEnvAttr) const /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Dialect/HAL/Target/VulkanSPIRV/VulkanSPIRVTarget.cpp:307:17
#12 0x00007f902fbd33a8 mlir::iree_compiler::IREE::HAL::VulkanSPIRVTargetBackend::getExecutableTargets(mlir::MLIRContext*) const /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Dialect/HAL/Target/VulkanSPIRV/VulkanSPIRVTarget.cpp:295:27
#13 0x00007f902fbd240f llvm::SmallVectorBase<unsigned int>::size() const /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:91:32
#14 0x00007f902fbd240f mlir::NamedAttribute& llvm::SmallVectorImpl<mlir::NamedAttribute>::emplace_back<mlir::StringAttr, mlir::ArrayAttr>(mlir::StringAttr&&, mlir::ArrayAttr&&) /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:942:9
#15 0x00007f902fbd240f mlir::iree_compiler::IREE::HAL::VulkanSPIRVTargetBackend::getDefaultDeviceTarget(mlir::MLIRContext*) const /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Dialect/HAL/Target/VulkanSPIRV/VulkanSPIRVTarget.cpp:110:17
#16 0x00007f902f9e5a2d mlir::iree_compiler::IREE::HAL::AssignTargetDevicesPass::runOnOperation() /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Dialect/HAL/Transforms/AssignTargetDevices.cpp:101:26
#17 0x00007f902ebc95a5 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_7::operator()() const /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:0:17
#18 0x00007f902ebc95a5 void llvm::function_ref<void ()>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_7>(long) /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
#19 0x00007f902ebc95a5 llvm::function_ref<void ()>::operator()() const /home/phaneesh/IREE/iree/third_party/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
#20 0x00007f902ebc95a5 void mlir::MLIRContext::executeAction<mlir::PassExecutionAction, mlir::Pass&>(llvm::function_ref<void ()>, llvm::ArrayRef<mlir::IRUnit>, mlir::Pass&) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/include/mlir/IR/MLIRContext.h:275:7
#21 0x00007f902ebc95a5 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:479:21
#22 0x00007f902ebc9d28 mlir::LogicalResult::failed() const /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/include/mlir/Support/LogicalResult.h:44:33
#23 0x00007f902ebc9d28 mlir::failed(mlir::LogicalResult) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/include/mlir/Support/LogicalResult.h:72:58
#24 0x00007f902ebc9d28 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:551:9
#25 0x00007f902ebcc09b mlir::PassManager::run(mlir::Operation*) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:0:0
#26 0x00007f902e999d12 mlir::LogicalResult::failed() const /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/include/mlir/Support/LogicalResult.h:44:33
#27 0x00007f902e999d12 mlir::failed(mlir::LogicalResult) /home/phaneesh/IREE/iree/third_party/llvm-project/mlir/include/mlir/Support/LogicalResult.h:72:58
#28 0x00007f902e999d12 mlir::iree_compiler::embed::(anonymous namespace)::Invocation::runPipeline(iree_compiler_pipeline_t) /home/phaneesh/IREE/iree/compiler/src/iree/compiler/API/Internal/Embed.cpp:788:7
#29 0x00007f902e999d12 ireeCompilerInvocationPipeline /home/phaneesh/IREE/iree/compiler/src/iree/compiler/API/Internal/Embed.cpp:1216:23
#30 0x00007f902eb95131 mlir::iree_compiler::runIreecMain(int, char**)::$_4::operator()(iree_compiler_source_t*) const /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:215:11
#31 0x00007f902eb94af1 mlir::iree_compiler::runIreecMain(int, char**) /home/phaneesh/IREE/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:0:10
#32 0x00007f9029629d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#33 0x00007f9029629e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#34 0x0000558f3cd756c5 _start (../IREE/iree-build/tools/iree-compile+0x16c5)
benvanik commented 1 year ago

This is an issue with our distribution using tensor dimensions without any indirection. There's some things we can do if this is a fundamental limitation (like, the workgroup count xyz can't cover all of the workgroups we need) but I suspect this is the issue we've had before where we're just taking dim 1 and shoving it into workgroup count y.