iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.82k stars 608 forks source link

Bytecode from `iree-import-tflite` causes debug info to be dropped in `translateModuleToLLVMIR` #16781

Open bjacob opened 7 months ago

bjacob commented 7 months ago

This is to continue the investigation in https://github.com/openxla/iree/pull/15756, which is being reverted now.

Summary:

https://github.com/openxla/iree/pull/15756 should have been a 1-line PR, simply adding -gline-tables-only to the clang flags that we use to compile ukernels to bitcode. All it does is generate some debug info in the compiled ukernels bitcode.

This runs into a bug discussed on this comment thread: https://github.com/openxla/iree/pull/15756/files#r1411583898

bjacob commented 7 months ago

@benvanik @ScottTodd @jpienaar

bjacob commented 7 months ago

If at the very start of the serializeExecutable function I dump variantOp, the resulting (text) dump is identical in the bad and good cases.

If I dump the parent mlir::ModuleOp, the only difference is in the namespace string; if I replace the name (sed 's/iree_PoseNet_fp32_tflite_/llvm_module/g') then the ModuleOp is identical.

I think at this point we have narrowed this bug down to something about bitcode. In the bad case, the module as loaded from bitcode is corrupted in a way that somehow causes debug info to be dropped in translateModuleToLLVMIR, but that corruption doesn't survive text-serialization, so we can't see it in any kind of textual IR dump.

bjacob commented 7 months ago

Confirmed with this experiment:

If I process the bad input file with iree-opt now with --emit-bytecode then the problem is still there in the output.

Whereas as noted earlier, without --emit-bytecode the problem goes away.

If I process first without --emit-bytecode (so I get text) and then I process that with --emit-bytecode then the problem is still gone. So the first processing to text IR did resolve some kind of corruption in the original IR. It's not like this MLIR module is consistently showing an issue whenever reencoded back to bytecode.

dan-garvey commented 7 months ago

Fascinating, I would have gotten lost the moment the bug failed to express as text

bjacob commented 7 months ago

Fully traced through BytecodeReader, here is the divergence:

Call stack at point where things diverge:

(gdb) bt
#0  (anonymous namespace)::DialectReader::readVarInt (this=0x7fffffff9eb0, result=@0x7fffffff9ce0: 15) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:993
#1  0x00007fffeac5e811 in (anonymous namespace)::readAttribute (context=0x5555555e90a0, reader=...) at llvm-project/tools/mlir/include/mlir/IR/BuiltinDialectBytecode.cpp.inc:360
#2  0x00007fffeac5e715 in (anonymous namespace)::BuiltinDialectBytecodeInterface::readAttribute (this=0x5555555f6540, reader=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/IR/BuiltinDialectBytecode.cpp:94
#3  0x00007ffff204c5d3 in (anonymous namespace)::AttrTypeReader::parseCustomEntry<mlir::Attribute> (this=0x7fffffffb510, entry=..., reader=..., entryType=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1328
#4  0x00007ffff204c11f in (anonymous namespace)::AttrTypeReader::resolveEntry<mlir::Attribute> (this=0x7fffffffb510, entries=..., index=0, entryType=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1236
#5  0x00007ffff204bf21 in (anonymous namespace)::AttrTypeReader::resolveAttribute (this=0x7fffffffb510, index=0) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:824
#6  0x00007ffff204be91 in (anonymous namespace)::AttrTypeReader::parseAttribute (this=0x7fffffffb510, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:833
#7  0x00007ffff2050e6f in (anonymous namespace)::AttrTypeReader::parseAttribute<mlir::LocationAttr> (this=0x7fffffffb510, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:859
#8  0x00007ffff204ff29 in mlir::BytecodeReader::Impl::parseAttribute<mlir::LocationAttr> (this=0x7fffffffb4b0, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1454
#9  0x00007ffff2046817 in mlir::BytecodeReader::Impl::parseOpWithoutRegions (this=0x7fffffffb4b0, reader=..., readState=..., isIsolatedFromAbove=@0x7fffffffab97: false) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2273
#10 0x00007ffff2045da7 in mlir::BytecodeReader::Impl::parseRegions (this=0x7fffffffb4b0, regionStack=std::vector of length 1, capacity 1 = {...}, readState=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2190
#11 0x00007ffff204406b in mlir::BytecodeReader::Impl::parseIRSection (this=0x7fffffffb4b0, sectionData=..., block=0x7fffffffbca8) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2126
#12 0x00007ffff204211c in mlir::BytecodeReader::Impl::read(mlir::Block*, llvm::function_ref<bool (mlir::Operation*)>) (this=0x7fffffffb4b0, block=0x7fffffffbca8, lazyOpsCallback=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1741
#13 0x00007ffff2047ddf in readBytecodeFileImpl (buffer=..., block=0x7fffffffbca8, config=..., bufferOwnerRef=std::shared_ptr<llvm::SourceMgr> (empty) = {...}) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2641
#14 0x00007ffff2047bab in mlir::readBytecodeFile (buffer=..., block=0x7fffffffbca8, config=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2649
#15 0x00007fffeadb12e2 in mlir::parseSourceFile (sourceMgr=..., block=0x7fffffffbca8, config=..., sourceFileLoc=0x7fffffffbcf0) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Parser/Parser.cpp:30
#16 0x00007fffea951f1f in mlir::detail::parseSourceFile<mlir::ModuleOp, llvm::SourceMgr const&> (config=..., args=...) at /home/benoit/iree/third_party/llvm-project/mlir/include/mlir/Parser/Parser.h:159
#17 0x00007fffea9506c8 in mlir::parseSourceFile<mlir::ModuleOp> (sourceMgr=..., config=...) at /home/benoit/iree/third_party/llvm-project/mlir/include/mlir/Parser/Parser.h:178
#18 0x00007fffea918263 in mlir::iree_compiler::embed::(anonymous namespace)::Invocation::parseSource (this=0x5555555b26c0, source=...) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:822
#19 0x00007fffea918182 in ireeCompilerInvocationParseSource (inv=0x5555555b26c0, source=0x555555648080) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:1371
#20 0x00007fffeae3e820 in mlir::iree_compiler::runIreecMain(int, char**)::$_2::operator()(iree_compiler_source_t*) const (this=0x7fffffffc040, source=0x555555648080) at /home/benoit/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:241
#21 0x00007fffeae3dd49 in mlir::iree_compiler::runIreecMain (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:348
#22 0x00007fffea96167b in ireeCompilerRunMain (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/IREECompileToolEntryPoint.cpp:12
#23 0x00005555555557a2 in main (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/tools/iree-compile-main.cc:9

Looking at frame #1, in the bad case, the readVarInt(kind) produces kind = 15, while in the good case it produces kind = 11.

Since it's in generated code I can't link it easily so here's a paste. This is at ${BUILD_DIR}/llvm-project/tools/mlir/include/mlir/IR/BuiltinDialectBytecode.cpp.inc:360.

static Attribute readAttribute(MLIRContext* context, DialectBytecodeReader &reader) {
  uint64_t kind;
  if (failed(reader.readVarInt(kind)))
    return Attribute();
  switch (kind) {
    case 0:
      return readArrayAttr(context, reader);
    case 1:
      return readDictionaryAttr(context, reader);
    case 2:
      return readStringAttr(context, reader);
    case 3:
      return readStringAttrWithType(context, reader);
    case 4:
      return readFlatSymbolRefAttr(context, reader);
    case 5:
      return readSymbolRefAttr(context, reader);
    case 6:
      return readTypeAttr(context, reader);
    case 7:
      return readUnitAttr(context, reader);
    case 8:
      return readIntegerAttr(context, reader);
    case 9:
      return readFloatAttr(context, reader);
    case 10:
      return readCallSiteLoc(context, reader);
    case 11:
      return readFileLineColLoc(context, reader);
    case 12:
      return readFusedLoc(context, reader);
    case 13:
      return readFusedLocWithMetadata(context, reader);
    case 14:
      return readNameLoc(context, reader);
    case 15:
      return readUnknownLoc(context, reader);
    case 16:
      return readDenseResourceElementsAttr(context, reader);
    case 17:
      return readDenseArrayAttr(context, reader);
    case 18:
      return readDenseIntOrFPElementsAttr(context, reader);
    case 19:
      return readDenseStringElementsAttr(context, reader);
    case 20:
      return readSparseElementsAttr(context, reader);
    case 21:
      return readDistinctAttr(context, reader);
    default:
      reader.emitError() << "unknown attribute code: " << kind;
      return Attribute();
  }
  return Attribute();
}

So:

Interpretation:

What's exactly the bug here? If simply round-tripping via text IR is enough to fix those locations, maybe that means that they should have been good locations in the first place, so the bug is iree-import-tflite producing UnknownLoc in the first place? Alternatively, could translateModuleToLLVMIR do something more graceful that just drop all debug info when it sees UnknownLoc ?

ScottTodd commented 7 months ago

The good bytecode file (obtained from the bad one by round-tripping through text IR) has FileLineColLoc attributes.

Can you give a sample value for one of those? Does it reference the .tflite file?

bjacob commented 7 months ago

Again the good bytecode file is obtained from the bad bytecode file here by two successive iree-opt commands round-tripping via text IR and back to bytecode. So there isn't any actual new information in the good bytecode file compared to what there is in the bad bytecode file. I'll still dump what those FileLineColLoc contain, moment.

ScottTodd commented 7 months ago

Here are some hypotheses:

We should also decouple the existence of source locations from whatever other debug information may or may not be included.

bjacob commented 7 months ago

Indeed:

Things I considered:

  1. I pondered whether it is a bug that iree-import-tflite creates UnknownLoc locations, but I couldn't convince myself that it is, as tflite files are binaries with no canonical text representation, so it's hard to see how any notion of location would be actually usable as debug information. Being a binary format is a size vs usability trade-off in the first place, and generating UnknownLoc is merely going with that flow.
  2. I pondered for a second whether it would make sense then to store text MLIR form as the source that debug info could refer to... and then I realized that that is exactly what https://github.com/openxla/iree/pull/16757 just did.
  3. I tried hacking the logic inside translateModuleToLLVMIR to handle UnknownLoc gracefully. I found that the relevant logic is in third_party/llvm-project/mlir/lib/Target/LLVMIR/DebugTranslation.cpp, where UnknownLoc is special-cased in a couple of places. The problem, as explained in a comment, is that LLVM IR does not have a notion of "unknown loc", so all locations have to be real locations. I tried returning a make-believe line-column location with (0, 0) but stumbled on errors complaining that the scope was invalid. I guess I just learned why the existing code had to special-case UnknownLoc. At this point I thought back about what I had just learned about https://github.com/openxla/iree/pull/16757 and decided to give up and instead do as follows:

What I now think we should do:

Remember that I need this in the first place to have debug info in ukernels, and the motivation for this is having a good Tracy source view. Now, as of https://github.com/openxla/iree/pull/16757 , users who care about having a good source view are users who pass --iree-hal-executable-debug-level=3, in which case CaptureExecutableSourcesPass runs and we have an actual source that debug info can refer to. So I now think that we should:

  1. When debugLevel < 3, strip debug info from ukernel bitcode, since the user won't get a good source view in the parent scopes anyway, not much point fighting for a good source view in the leaf ukernel scopes.
  2. When debugLevel >= 3, ensure that we do have source locations. As noted earlier, --iree-hal-dump-executable-sources-to fixes the problem by creating source locations. Now, https://github.com/openxla/iree/pull/16757 notes: "Unlike the DumpExecutableSourcesPass this does not change the original source locations [...]" and indeed I can see that passing --iree-hal-executable-debug-level=3 does not have the same effect of fixing my bug. But I'm hoping that some minor change here will do the trick (e.g. perhaps making this add source locations like --iree-hal-dump-executable-sources-to does? I'm not clear enough about the nuances here to be sure I'm making sense...)

@benvanik @ScottTodd

ScottTodd commented 7 months ago

(e.g. perhaps making this add source locations like --iree-hal-dump-executable-sources-to does? I'm not clear enough about the nuances here to be sure I'm making sense...)

The decoupling there was intentional. Tracing showing snapshotted sources produced by --iree-hal-dump-executable-sources was a clever trick. The new mechanism is explicit rather than implicit, so it should be easier to understand and work with.

I think the best paths forward are some mix of

Whether that is connected with --iree-hal-executable-debug-level... idk. The numbers there are arbitrary and mostly based on how much they increase binary size. We haven't done too much consistently with that option yet. Could possibly snapshot locations if they are missing and the level is >= 1?

benvanik commented 7 months ago

We need the upstream bug fixed. There's no reason debug info should fail if there's an UnknownLoc. I think there's a bit too much focus on workarounds that dictate how features behave in ways that should be orthogonal. --iree-hal-executable-debug-level= is explicitly designed to not reassign source locations for reasons.

bjacob commented 7 months ago

I meant something like Scott alludes to in the last sentence of his above comment:

Could possibly snapshot locations if they are missing and the level is >= 1?

Emphasis on "if they are missing".

The situation here is the locations are UnknownLoc, so there is a need to add actual locations to get a useful source view. This is likely a preexisting problem: I bet that if we try to tracy those tflite models with UnknownLoc, source view isn't showing anything. My point was, this doesn't even feel like an "upstream bug" - there is no source location because there is no source, and even that may not be a bug given that TFLite is a binary format. So like Scott suggests in the above pasted sentence, I was thinking that for high enough debugLevel (I was suggesting debugLevel >= 3) we could actually rewrite UnknownLoc locations with the snapshotted source locations. I'm OK with leaving alone any existing actual locations (anything else than UnknownLoc).

If you still see an "upstream bug" here that needs fixing, I would like to understand what that is. I can think of two potential candidates but I'm way out of my depth in either:

  1. DebugTranslation.cpp special-casing UnknownLoc and completely giving up when it sees it.
  2. The lower-level LLVM error when a function with debug info is inlined at a call site without debug info. Certainly that seems unnecessarily rigid, but I'm extremely out of my depth there.
bjacob commented 7 months ago

After posting the previous comment I thought: Wait, either I'm missing something, or #16757 does not actually fix #15699.

I am at current top-of-tree (https://github.com/openxla/iree/commit/ee32fc7b74b4aad49fc46d67ea6c1f617416c4d4). I did an experiment on a OPT-1.3b model from shark_tank, so this is a PyTorch model not TFLite. The command lines are in this gist: https://gist.github.com/bjacob/75717f6335603628d09c58313aa0d47d . It's totally self contained with a link to download the imported MLIR from.

I just added the usual Tracy-ing flags, and ran two experiments:

  1. With just --iree-hal-executable-debug-level=3 added to the compilation command line. Result: no source view (first screenshot below).
  2. With also --iree-hal-dump-executable-sources-to=/tmp. Result: source view (second screenshot below).

I thought that "#16757 fixes #15699" meant that --iree-hal-executable-debug-level=3 by itself would suffice to obtain the Tracy source view, without adding --iree-hal-dump-executable-sources-to=/tmp.

Screenshots:

  1. No source view with --iree-hal-executable-debug-level=3 :
    • bad
  2. Source view with --iree-hal-dump-executable-sources-to=/tmp:
    • good
ScottTodd commented 7 months ago

I just added the usual Tracy-ing flags, and ran two experiments:

  1. With just --iree-hal-executable-debug-level=3 added to the compilation command line. Result: no source view (first screenshot below).
  2. With also --iree-hal-dump-executable-sources-to=/tmp. Result: source view (second screenshot below).

For Tracy, you should prefer the new --iree-hal-capture-executable-sources pass. That stays in process (instead of dumping files to disk) and is explicit about the link between source files and tracing. (I'll update the docs soon)

ScottTodd commented 7 months ago

I bet that if we try to tracy those tflite models with UnknownLoc, source view isn't showing anything.

Source locations and debug information should be separable. We don't want to have a tight coupling between those concepts. If one frontend like TFLite doesn't include one type of metadata, that shouldn't stop us from including other kinds of metadata.

bjacob commented 7 months ago

For Tracy, you should prefer the new --iree-hal-capture-executable-sources pass.

Ah, do I need to patch some PR? Trying --iree-hal-capture-executable-sources gives me:

iree-compile: Unknown command line argument '--iree-hal-capture-executable-sources'.  Try: 'tools/iree-compile --help'
iree-compile: Did you mean '--iree-hal-substitute-executable-source'?

I tried the suggested one but it needs a value?

That stays in process (instead of dumping files to disk) and is explicit about the link between source files and tracing. (I'll update the docs soon)

I see, that sounds great. I had assumed that --iree-hal-executable-debug-level=3 would have that behavior as of https://github.com/openxla/iree/pull/16757 . Why wouldn't it? When I specify --iree-hal-executable-debug-level=3 , I'm already conveying "go ahead and bloat my module but make Tracy work!"

Source locations and debug information should be separable. We don't want to have a tight coupling between those concepts.

I need a lecture on this topic. I just don't really understand at the moment.

ScottTodd commented 7 months ago

Let me confirm how the new pass is working... it might be actually be added automatically with debug level >= 3.

Ah, do I need to patch some PR?

Depends on what version/commit you're building at :p. You do need Ben's PR / https://github.com/openxla/iree/commit/15d903960dff911f5ebaf28b467507f62876a5eb

bjacob commented 7 months ago

Yep, as mentioned above I am at current top-of-tree (https://github.com/openxla/iree/commit/ee32fc7b74b4aad49fc46d67ea6c1f617416c4d4) which is past that (I checked).

benvanik commented 7 months ago

the extra sources are embedded with debug-level>=3

What source locations are passed to LLVM and what source files we embed in the executables are orthogonal. It is useful to be able to point at our new captured sources, but that should be a separate flag. We could have an iree-hal-executable-source-stage= flag that does the location rewrite to a captured stage by name. This way you could have the debug info reference tensor linalg, buffer linag, LLVM dialect, etc. If we added that flag we could remove the implicit source location overwriting of dump-sources, as that was always a hack.

But we want that to be optional and off by default: to users (not internal devs) the source locations should always point back to their source programs by default. And if because of a bug that means this fails to work we need to fix that bug.

So:

bjacob commented 7 months ago

I see.

What I'm seeing here is that with both TFLite and Pytorch sources, often we just don't have source locations out of the import process. So tracy source view doesn't show anything by default. Could we in that case and under debug-level>=3 set source locations to the embedded executable sources (which are tensor-linalg) ? Way more useful than no source at all.

Also a technicality - I get your explanation of "What source locations are passed to LLVM and what source files we embed in the executables are orthogonal", but then I don't understand how that works. If the source files in the embedded executable sources are orthogonal to the source locations passed to LLVM, how is the mapping done between locations in the final object code and locations in the embedded executable sources ?

benvanik commented 7 months ago

I think I'm missing something - are you saying there are zero locations from tflite and pytorch? Or just some missing locations? I don't think I understand how there can be zero locations - there's always some source file, unless the user strips them and sets them all to UnknownLoc - but users should never do that (and if there's an option for that in those tools it should be removed). If there's some partial missing information then that's the bug to fix.

The pass that captures the locations (which runs several times, each time with a compilation stage name) can set the flag to replace the source locations to point at the captured source - that's exactly what dump-sources does today - the only difference is in one we update the locations and in the other we don't.

bjacob commented 7 months ago

Aah so to actually answer that question I had to patch the above-mentioned generated .cpp.inc parser code and my diff is there: https://gist.github.com/bjacob/ef04e138d571348edea0f5808c43fc3c

And you're right actually the "bad" imported bytecode has only one UnknownLoc and a large number of NameLoc. The "good" bytecode obtained by round-trip via text, has all FileLineLoc.

In the below table, "bad" means exhibits the pathologies discussed here, no source view etc.

Input to iree-compile UnknownLoc NameLoc FileLineColLoc
PoseNet from iree-import-tflite (bad) 1 85 0
the same, round-tripped via text (good) 0 0 128

Now I tried running the same experiment on that OPT-1.3 model but realized that it is text IR to begin with so it just doesn't go through that bytecode-reader code, and I have no idea why it exhibits the pathology.

bjacob commented 7 months ago

These are the NameLoc values in this PoseNet model: https://gist.github.com/bjacob/31fdeb0902a42461d5cca6902aef6080

bjacob commented 7 months ago

At this point I think I agree that it seems really unnecessarily rigid of the current DebugTranslation.cpp code to just give up on this perfectly good debug info just because of one UnknownLoc.

bjacob commented 7 months ago

Note though, at this point we are trying to solve 2 problems, and a fix at that level would only solve 1 not 2:

  1. Failure to inline ukernels that have debug info.
  2. Failure to show a useful source view.
bjacob commented 7 months ago

I tried again fixing translateLoc in DebugTranslation.cpp and noticed that in our situation where all the locations are NameLoc, it emits no LLVM DI at all. Indeed, here is what it does with NameLoc:

https://github.com/llvm/llvm-project/blob/6800f422c22ffd672b1e31f0d0a3fa29d19b7a13/mlir/lib/Target/LLVMIR/DebugTranslation.cpp#L434

So it's just ignoring the name string in the NameLoc and walking straight to the ChildLoc, ending with an UnknownLoc, returning null.

jpienaar commented 7 months ago

Note: the round tripped via text is actually bad. It loses all traces to the original Tflite file and references an intermediate file. It accidentally works. The bug is how namedloc is being treated, well and indeed mixing two concerns together (but one definitely can have a namedloc and debug info and that should work).