nod-ai / SHARK-ModelDev

Unified compiler/runtime for interfacing with PyTorch Dynamo.
Apache License 2.0
95 stars 48 forks source link

Memory issue? #878

Open pdhirajkumarprasad opened 2 weeks ago

pdhirajkumarprasad commented 2 weeks ago

For the attached IR, seeing crash as

iree-compile: /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:2868: SmallVector<mlir::Value> mlir::TypeConverter::materializeTargetConversion(mlir::OpBuilder &, mlir::Location, mlir::TypeRange, mlir::ValueRange, mlir::Type) const: Assertion `TypeRange(ValueRange(result)) == resultTypes && "callback produced incorrect number of values or values with " "incorrect types"' failed.
Please report issues to https://github.com/iree-org/iree/issues and include the crash backtrace.
Stack dump:
0.  Program arguments: iree-compile --iree-hal-target-backends=llvm-cpu model.torch_onnx.mlir -o abc.vmfb
 #0 0x00007f46bb61a9b7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:13
 #1 0x00007f46bb618bf0 llvm::sys::RunSignalHandlers() /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/lib/Support/Signals.cpp:106:18
 #2 0x00007f46bb61b07a SignalHandler(int) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1
 #3 0x00007f46b5855520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007f46b58a99fc __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #5 0x00007f46b58a99fc __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #6 0x00007f46b58a99fc pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #7 0x00007f46b5855476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007f46b583b7f3 abort ./stdlib/./stdlib/abort.c:81:7
 #9 0x00007f46b583b71b _nl_load_domain ./intl/./intl/loadmsgcat.c:1177:9
#10 0x00007f46b584ce96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
#11 0x00007f46bfb1978f (/proj/xhdhdstaff6/dhirajp/localBuild/iree-build/lib/libIREECompiler.so+0x9eb278f)
#12 0x00007f46bfb19596 llvm::SmallVectorBase<unsigned int>::empty() const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:81:46
#13 0x00007f46bfb19596 mlir::TypeConverter::materializeTargetConversion(mlir::OpBuilder&, mlir::Location, mlir::Type, mlir::ValueRange, mlir::Type) const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:2851:14
#14 0x00007f46bfb16d6d legalizeUnresolvedMaterialization(mlir::RewriterBase&, (anonymous namespace)::UnresolvedMaterializationRewrite*) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:0:0
#15 0x00007f46bfb16d6d mlir::OperationConverter::convertOperations(llvm::ArrayRef<mlir::Operation*>) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:2528:18
#16 0x00007f46bfb1ca5b mlir::applyPartialConversion(llvm::ArrayRef<mlir::Operation*>, mlir::ConversionTarget const&, mlir::FrozenRewritePatternSet const&, mlir::ConversionConfig) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:3258:22
#17 0x00007f46bfb1ca5b mlir::applyPartialConversion(mlir::Operation*, mlir::ConversionTarget const&, mlir::FrozenRewritePatternSet const&, mlir::ConversionConfig) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Transforms/Utils/DialectConversion.cpp:3264:10
#18 0x00007f46bd033bb1 mlir::iree_compiler::IREE::VM::ConversionPass::runOnOperation() /proj/xhdhdstaff6/dhirajp/localBuild/iree/compiler/src/iree/compiler/Dialect/VM/Transforms/Conversion.cpp:168:16
#19 0x00007f46bb80a835 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_7::operator()() const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:0:17
#20 0x00007f46bb80a835 void llvm::function_ref<void ()>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_7>(long) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
#21 0x00007f46bb80a835 llvm::function_ref<void ()>::operator()() const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
#22 0x00007f46bb80a835 void mlir::MLIRContext::executeAction<mlir::PassExecutionAction, mlir::Pass&>(llvm::function_ref<void ()>, llvm::ArrayRef<mlir::IRUnit>, mlir::Pass&) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/include/mlir/IR/MLIRContext.h:280:7
#23 0x00007f46bb80a835 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:520:21
#24 0x00007f46bb80afa8 llvm::LogicalResult::failed() const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/Support/LogicalResult.h:43:43
#25 0x00007f46bb80afa8 llvm::failed(llvm::LogicalResult) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/Support/LogicalResult.h:71:58
#26 0x00007f46bb80afa8 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:592:9
#27 0x00007f46bb80d2f9 mlir::PassManager::run(mlir::Operation*) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/mlir/lib/Pass/Pass.cpp:0:0
#28 0x00007f46bb56cd60 llvm::LogicalResult::failed() const /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/Support/LogicalResult.h:43:43
#29 0x00007f46bb56cd60 llvm::failed(llvm::LogicalResult) /proj/xhdhdstaff6/dhirajp/localBuild/iree/third_party/llvm-project/llvm/include/llvm/Support/LogicalResult.h:71:58
#30 0x00007f46bb56cd60 mlir::iree_compiler::embed::(anonymous namespace)::Invocation::runPipeline(iree_compiler_pipeline_t) /proj/xhdhdstaff6/dhirajp/localBuild/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:1008:7
#31 0x00007f46bb56cd60 ireeCompilerInvocationPipeline /proj/xhdhdstaff6/dhirajp/localBuild/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:1447:23
#32 0x00007f46bb796528 mlir::iree_compiler::runIreecMain(int, char**)::$_2::operator()(iree_compiler_source_t*) const /proj/xhdhdstaff6/dhirajp/localBuild/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:254:11
#33 0x00007f46bb795d61 mlir::iree_compiler::runIreecMain(int, char**) /proj/xhdhdstaff6/dhirajp/localBuild/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:0:10
#34 0x00007f46b583cd90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#35 0x00007f46b583ce40 call_init ./csu/../csu/libc-start.c:128:20
#36 0x00007f46b583ce40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#37 0x0000557b30e026d5 _start (/proj/xhdhdstaff6/dhirajp/localBuild/iree-build/tools/iree-compile+0x16d5)
Abort (core dumped)

1> If we run 10 times, 2/3 times, it works fine while rest of time, it's giving above stack 2> in IR, we have few line commented, which is not needed, If I delete those then it works fine for most of time

command: iree-compile --iree-hal-target-backends=llvm-cpu model.torch_onnx.mlir -o abc.vmfb tt.mlir.txt

AmosLewis commented 2 weeks ago

iree-compile --iree-hal-target-backends=llvm-cpu model.mlir -o model.vmfb --dump-compilation-phases-to=./tmp/ In the phases output, the hal is generated. The error happens when lower hal to vm phase

module {
^
<unknown>:0: error: failed to legalize unresolved materialization from ('i64') to ('index') that remained live after conversion
<unknown>:0: note: see current operation: %18 = "builtin.unrealized_conversion_cast"(%17) : (i64) -> index
model.mlir:865:12: note: see existing live user here: %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %19, %0, %1
    %867 = torch.operator "onnx.Add"(%866, %813) : (!torch.vtensor<[?,256,768],f32>, !torch.vtensor<[1,256,768],f32>) -> !torch.vtensor<[?,256,768],f32> 
           ^
model.mlir:1:1: error: conversion to vm.module failed

The VM will be created successfully if delete the code after %866 = torch.operator "onnx.Add"(%838, %865) : (!torch.vtensor<[?,256,768],f32>, !torch.vtensor<[?,256,768],f32>) -> !torch.vtensor<[?,256,768],f32>

AmosLewis commented 2 weeks ago

get smallest reproducer iree-compile --iree-hal-target-backends=llvm-cpu model.mlir -o model.vmfb --dump-compilation-phases-to=./tmp/

module {
  func.func @tf2onnx(%arg0: !torch.vtensor<[?,768],f32>, %arg1: !torch.vtensor<[3],si64>, %arg2: !torch.vtensor<[?,256,768],f32>) -> ( !torch.vtensor<[?,256,768],f32>) attributes {torch.onnx_meta.ir_version = 7 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "tf2onnx", torch.onnx_meta.producer_version = "1.5.2"} {
    %reshape = torch.operator "onnx.Reshape"(%arg0, %arg1) : (!torch.vtensor<[?,768],f32>, !torch.vtensor<[3],si64>) -> !torch.vtensor<[?,256,768],f32> 
    %866 = torch.operator "onnx.Add"(%reshape, %arg2) : (!torch.vtensor<[?,256,768],f32>, !torch.vtensor<[?,256,768],f32>) -> !torch.vtensor<[?,256,768],f32> 
    return %866 :  !torch.vtensor<[?,256,768],f32>
  }
}