iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 571 forks source link

Seg fault for llama.8b.fp16.mlir model #18353

Closed pdhirajkumarprasad closed 2 weeks ago

pdhirajkumarprasad commented 2 weeks ago

What happened?

Following mlir getting seg fault.

https://raw.githubusercontent.com/nod-ai/llm-dev/main/models/llama.8b/llama.8b.fp16.mlir

Steps to reproduce your issue

wget https://raw.githubusercontent.com/nod-ai/llm-dev/main/models/llama.8b/llama.8b.fp16.mlir

iree-compile --iree-hal-target-backends=rocm --iree-input-demote-i64-to-i32 --iree-hip-target=gfx942 llama.8b.fp16.mlir

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

No response

nirvedhmeshram commented 2 weeks ago

I think this is an issue with weight parameters handling, here is a full crash dump See this line

#20 0x00007e0514d070b9 mlir::iree_compiler::IREE::VM::ZIPArchiveWriter::flush(mlir::iree_compiler::FlatbufferBuilder&)
 /home/nmeshram/iree/compiler/src/iree/compiler/Dialect/VM/Target/Bytecode/ArchiveWriter.cpp:661:20

I have seen this when weights were not correctly ellided but in this models input IR I dont see such an issue The provided model has these weights

  util.global private @__auto.token_embd.weight = #stream.parameter.named<"model"::"token_embd.weight"> : tensor<128256x4096xf16>
  util.global private @__auto.blk.0.attn_norm.weight = #stream.parameter.named<"model"::"blk.0.attn_norm.weight"> : tensor<4096xf32>
  util.global private @__auto.blk.0.attn_q.weight = #stream.parameter.named<"model"::"blk.0.attn_q.weight"> : tensor<4096x4096xf16>
...

which I assume will be provided at runtime but we have this crash at compile time. @benvanik @MaheshRavishankar do you see something wrong in the input mlir?

Edit: oh I see this on line 7 of the provided mlir

util.global private @__auto.constant_8192_64_torch.complex64 = dense_resource<__auto.constant_8192_64_torch.complex64> : tensor<8192x64xcomplex<f32>>

Isnt this the same thing as being ellided without the proper annotation?

benvanik commented 2 weeks ago

Same as the issues before - someone manually deleted resources that are required for correct processing of the IR. Is there some workflow people are now doing that involves deleting critical lines of MLIR files?

benvanik commented 2 weeks ago

(we should have guards for not crashing and emitting an error, but emitting an error is the best we can do in these cases - so worth both having better errors and figuring out what workflow is causing this to happen as we've had multiple people doing it this week)

MaheshRavishankar commented 2 weeks ago

@pdhirajkumarprasad this seems like a user error. Please close if this is fixed. I am taking it out of the compilation error tracking project

pdhirajkumarprasad commented 2 weeks ago

this issue is no more there in nightly