Open bjacob opened 7 months ago
@benvanik @ScottTodd @jpienaar
If at the very start of the serializeExecutable
function I dump variantOp
, the resulting (text) dump is identical in the bad and good cases.
If I dump the parent mlir::ModuleOp
, the only difference is in the namespace string; if I replace the name (sed 's/iree_PoseNet_fp32_tflite_/llvm_module/g'
) then the ModuleOp
is identical.
I think at this point we have narrowed this bug down to something about bitcode. In the bad case, the module as loaded from bitcode is corrupted in a way that somehow causes debug info to be dropped in translateModuleToLLVMIR
, but that corruption doesn't survive text-serialization, so we can't see it in any kind of textual IR dump.
Confirmed with this experiment:
If I process the bad input file with iree-opt
now with --emit-bytecode
then the problem is still there in the output.
Whereas as noted earlier, without --emit-bytecode
the problem goes away.
If I process first without --emit-bytecode
(so I get text) and then I process that with --emit-bytecode
then the problem is still gone. So the first processing to text IR did resolve some kind of corruption in the original IR. It's not like this MLIR module is consistently showing an issue whenever reencoded back to bytecode.
Fascinating, I would have gotten lost the moment the bug failed to express as text
Fully traced through BytecodeReader, here is the divergence:
Call stack at point where things diverge:
(gdb) bt
#0 (anonymous namespace)::DialectReader::readVarInt (this=0x7fffffff9eb0, result=@0x7fffffff9ce0: 15) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:993
#1 0x00007fffeac5e811 in (anonymous namespace)::readAttribute (context=0x5555555e90a0, reader=...) at llvm-project/tools/mlir/include/mlir/IR/BuiltinDialectBytecode.cpp.inc:360
#2 0x00007fffeac5e715 in (anonymous namespace)::BuiltinDialectBytecodeInterface::readAttribute (this=0x5555555f6540, reader=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/IR/BuiltinDialectBytecode.cpp:94
#3 0x00007ffff204c5d3 in (anonymous namespace)::AttrTypeReader::parseCustomEntry<mlir::Attribute> (this=0x7fffffffb510, entry=..., reader=..., entryType=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1328
#4 0x00007ffff204c11f in (anonymous namespace)::AttrTypeReader::resolveEntry<mlir::Attribute> (this=0x7fffffffb510, entries=..., index=0, entryType=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1236
#5 0x00007ffff204bf21 in (anonymous namespace)::AttrTypeReader::resolveAttribute (this=0x7fffffffb510, index=0) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:824
#6 0x00007ffff204be91 in (anonymous namespace)::AttrTypeReader::parseAttribute (this=0x7fffffffb510, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:833
#7 0x00007ffff2050e6f in (anonymous namespace)::AttrTypeReader::parseAttribute<mlir::LocationAttr> (this=0x7fffffffb510, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:859
#8 0x00007ffff204ff29 in mlir::BytecodeReader::Impl::parseAttribute<mlir::LocationAttr> (this=0x7fffffffb4b0, reader=..., result=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1454
#9 0x00007ffff2046817 in mlir::BytecodeReader::Impl::parseOpWithoutRegions (this=0x7fffffffb4b0, reader=..., readState=..., isIsolatedFromAbove=@0x7fffffffab97: false) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2273
#10 0x00007ffff2045da7 in mlir::BytecodeReader::Impl::parseRegions (this=0x7fffffffb4b0, regionStack=std::vector of length 1, capacity 1 = {...}, readState=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2190
#11 0x00007ffff204406b in mlir::BytecodeReader::Impl::parseIRSection (this=0x7fffffffb4b0, sectionData=..., block=0x7fffffffbca8) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2126
#12 0x00007ffff204211c in mlir::BytecodeReader::Impl::read(mlir::Block*, llvm::function_ref<bool (mlir::Operation*)>) (this=0x7fffffffb4b0, block=0x7fffffffbca8, lazyOpsCallback=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:1741
#13 0x00007ffff2047ddf in readBytecodeFileImpl (buffer=..., block=0x7fffffffbca8, config=..., bufferOwnerRef=std::shared_ptr<llvm::SourceMgr> (empty) = {...}) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2641
#14 0x00007ffff2047bab in mlir::readBytecodeFile (buffer=..., block=0x7fffffffbca8, config=...) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Bytecode/Reader/BytecodeReader.cpp:2649
#15 0x00007fffeadb12e2 in mlir::parseSourceFile (sourceMgr=..., block=0x7fffffffbca8, config=..., sourceFileLoc=0x7fffffffbcf0) at /home/benoit/iree/third_party/llvm-project/mlir/lib/Parser/Parser.cpp:30
#16 0x00007fffea951f1f in mlir::detail::parseSourceFile<mlir::ModuleOp, llvm::SourceMgr const&> (config=..., args=...) at /home/benoit/iree/third_party/llvm-project/mlir/include/mlir/Parser/Parser.h:159
#17 0x00007fffea9506c8 in mlir::parseSourceFile<mlir::ModuleOp> (sourceMgr=..., config=...) at /home/benoit/iree/third_party/llvm-project/mlir/include/mlir/Parser/Parser.h:178
#18 0x00007fffea918263 in mlir::iree_compiler::embed::(anonymous namespace)::Invocation::parseSource (this=0x5555555b26c0, source=...) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:822
#19 0x00007fffea918182 in ireeCompilerInvocationParseSource (inv=0x5555555b26c0, source=0x555555648080) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/CompilerDriver.cpp:1371
#20 0x00007fffeae3e820 in mlir::iree_compiler::runIreecMain(int, char**)::$_2::operator()(iree_compiler_source_t*) const (this=0x7fffffffc040, source=0x555555648080) at /home/benoit/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:241
#21 0x00007fffeae3dd49 in mlir::iree_compiler::runIreecMain (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/compiler/src/iree/compiler/Tools/iree_compile_lib.cc:348
#22 0x00007fffea96167b in ireeCompilerRunMain (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/compiler/src/iree/compiler/API/Internal/IREECompileToolEntryPoint.cpp:12
#23 0x00005555555557a2 in main (argc=9, argv=0x7fffffffda08) at /home/benoit/iree/tools/iree-compile-main.cc:9
Looking at frame #1
, in the bad case, the readVarInt(kind)
produces kind = 15
, while in the good case it produces kind = 11
.
Since it's in generated code I can't link it easily so here's a paste. This is at ${BUILD_DIR}/llvm-project/tools/mlir/include/mlir/IR/BuiltinDialectBytecode.cpp.inc:360
.
static Attribute readAttribute(MLIRContext* context, DialectBytecodeReader &reader) {
uint64_t kind;
if (failed(reader.readVarInt(kind)))
return Attribute();
switch (kind) {
case 0:
return readArrayAttr(context, reader);
case 1:
return readDictionaryAttr(context, reader);
case 2:
return readStringAttr(context, reader);
case 3:
return readStringAttrWithType(context, reader);
case 4:
return readFlatSymbolRefAttr(context, reader);
case 5:
return readSymbolRefAttr(context, reader);
case 6:
return readTypeAttr(context, reader);
case 7:
return readUnitAttr(context, reader);
case 8:
return readIntegerAttr(context, reader);
case 9:
return readFloatAttr(context, reader);
case 10:
return readCallSiteLoc(context, reader);
case 11:
return readFileLineColLoc(context, reader);
case 12:
return readFusedLoc(context, reader);
case 13:
return readFusedLocWithMetadata(context, reader);
case 14:
return readNameLoc(context, reader);
case 15:
return readUnknownLoc(context, reader);
case 16:
return readDenseResourceElementsAttr(context, reader);
case 17:
return readDenseArrayAttr(context, reader);
case 18:
return readDenseIntOrFPElementsAttr(context, reader);
case 19:
return readDenseStringElementsAttr(context, reader);
case 20:
return readSparseElementsAttr(context, reader);
case 21:
return readDistinctAttr(context, reader);
default:
reader.emitError() << "unknown attribute code: " << kind;
return Attribute();
}
return Attribute();
}
So:
readUnknownLoc
.readFileLineColLoc
.Interpretation:
iree-import-tflite
has UnknownLoc
attributes. These result in eventually dropping most debug information at the end of compilation in translateModuleToLLVMIR
.FileLineColLoc
attributes. These result in real debug information surviving translateModuleToLLVMIR
.What's exactly the bug here? If simply round-tripping via text IR is enough to fix those locations, maybe that means that they should have been good locations in the first place, so the bug is iree-import-tflite
producing UnknownLoc
in the first place? Alternatively, could translateModuleToLLVMIR
do something more graceful that just drop all debug info when it sees UnknownLoc
?
The good bytecode file (obtained from the bad one by round-tripping through text IR) has
FileLineColLoc
attributes.
Can you give a sample value for one of those? Does it reference the .tflite
file?
Again the good bytecode file is obtained from the bad bytecode file here by two successive iree-opt
commands round-tripping via text IR and back to bytecode. So there isn't any actual new information in the good bytecode file compared to what there is in the bad bytecode file. I'll still dump what those FileLineColLoc
contain, moment.
Here are some hypotheses:
iree-import-tflite
does not include source informationWe should also decouple the existence of source locations from whatever other debug information may or may not be included.
Indeed:
iree-import-tflite
creates MLIR bytecode with UnknownLoc
locations..mlirbc
to text .mlir
results in locations being created, referencing lines/columns in that text .mlir
file. Then converting back to .mlirbc
preserves these locations into that text .mlir
file. That is how round-tripping avoids the issue.iree-import-tflite
creates UnknownLoc
locations, but I couldn't convince myself that it is, as tflite files are binaries with no canonical text representation, so it's hard to see how any notion of location would be actually usable as debug information. Being a binary format is a size vs usability trade-off in the first place, and generating UnknownLoc
is merely going with that flow.translateModuleToLLVMIR
to handle UnknownLoc
gracefully. I found that the relevant logic is in third_party/llvm-project/mlir/lib/Target/LLVMIR/DebugTranslation.cpp
, where UnknownLoc
is special-cased in a couple of places. The problem, as explained in a comment, is that LLVM IR does not have a notion of "unknown loc", so all locations have to be real locations. I tried returning a make-believe line-column location with (0, 0) but stumbled on errors complaining that the scope was invalid. I guess I just learned why the existing code had to special-case UnknownLoc. At this point I thought back about what I had just learned about https://github.com/openxla/iree/pull/16757 and decided to give up and instead do as follows:Remember that I need this in the first place to have debug info in ukernels, and the motivation for this is having a good Tracy source view. Now, as of https://github.com/openxla/iree/pull/16757 , users who care about having a good source view are users who pass --iree-hal-executable-debug-level=3
, in which case CaptureExecutableSourcesPass
runs and we have an actual source that debug info can refer to. So I now think that we should:
debugLevel < 3
, strip debug info from ukernel bitcode, since the user won't get a good source view in the parent scopes anyway, not much point fighting for a good source view in the leaf ukernel scopes.debugLevel >= 3
, ensure that we do have source locations. As noted earlier, --iree-hal-dump-executable-sources-to
fixes the problem by creating source locations. Now, https://github.com/openxla/iree/pull/16757 notes: "Unlike the DumpExecutableSourcesPass this does not change the original source locations [...]" and indeed I can see that passing --iree-hal-executable-debug-level=3
does not have the same effect of fixing my bug. But I'm hoping that some minor change here will do the trick (e.g. perhaps making this add source locations like --iree-hal-dump-executable-sources-to
does? I'm not clear enough about the nuances here to be sure I'm making sense...)@benvanik @ScottTodd
(e.g. perhaps making this add source locations like
--iree-hal-dump-executable-sources-to
does? I'm not clear enough about the nuances here to be sure I'm making sense...)
The decoupling there was intentional. Tracing showing snapshotted sources produced by --iree-hal-dump-executable-sources
was a clever trick. The new mechanism is explicit rather than implicit, so it should be easier to understand and work with.
I think the best paths forward are some mix of
Whether that is connected with --iree-hal-executable-debug-level
... idk. The numbers there are arbitrary and mostly based on how much they increase binary size. We haven't done too much consistently with that option yet. Could possibly snapshot locations if they are missing and the level is >= 1?
We need the upstream bug fixed. There's no reason debug info should fail if there's an UnknownLoc. I think there's a bit too much focus on workarounds that dictate how features behave in ways that should be orthogonal. --iree-hal-executable-debug-level=
is explicitly designed to not reassign source locations for reasons.
I meant something like Scott alludes to in the last sentence of his above comment:
Could possibly snapshot locations if they are missing and the level is >= 1?
Emphasis on "if they are missing".
The situation here is the locations are UnknownLoc, so there is a need to add actual locations to get a useful source view. This is likely a preexisting problem: I bet that if we try to tracy those tflite models with UnknownLoc, source view isn't showing anything. My point was, this doesn't even feel like an "upstream bug" - there is no source location because there is no source, and even that may not be a bug given that TFLite is a binary format. So like Scott suggests in the above pasted sentence, I was thinking that for high enough debugLevel (I was suggesting debugLevel >= 3) we could actually rewrite UnknownLoc locations with the snapshotted source locations. I'm OK with leaving alone any existing actual locations (anything else than UnknownLoc).
If you still see an "upstream bug" here that needs fixing, I would like to understand what that is. I can think of two potential candidates but I'm way out of my depth in either:
After posting the previous comment I thought: Wait, either I'm missing something, or #16757 does not actually fix #15699.
I am at current top-of-tree (https://github.com/openxla/iree/commit/ee32fc7b74b4aad49fc46d67ea6c1f617416c4d4). I did an experiment on a OPT-1.3b model from shark_tank, so this is a PyTorch model not TFLite. The command lines are in this gist: https://gist.github.com/bjacob/75717f6335603628d09c58313aa0d47d . It's totally self contained with a link to download the imported MLIR from.
I just added the usual Tracy-ing flags, and ran two experiments:
--iree-hal-executable-debug-level=3
added to the compilation command line. Result: no source view (first screenshot below).--iree-hal-dump-executable-sources-to=/tmp
. Result: source view (second screenshot below).I thought that "#16757 fixes #15699" meant that --iree-hal-executable-debug-level=3
by itself would suffice to obtain the Tracy source view, without adding --iree-hal-dump-executable-sources-to=/tmp
.
Screenshots:
--iree-hal-executable-debug-level=3
:
--iree-hal-dump-executable-sources-to=/tmp
:
I just added the usual Tracy-ing flags, and ran two experiments:
- With just
--iree-hal-executable-debug-level=3
added to the compilation command line. Result: no source view (first screenshot below).- With also
--iree-hal-dump-executable-sources-to=/tmp
. Result: source view (second screenshot below).
For Tracy, you should prefer the new --iree-hal-capture-executable-sources
pass. That stays in process (instead of dumping files to disk) and is explicit about the link between source files and tracing. (I'll update the docs soon)
I bet that if we try to tracy those tflite models with UnknownLoc, source view isn't showing anything.
Source locations and debug information should be separable. We don't want to have a tight coupling between those concepts. If one frontend like TFLite doesn't include one type of metadata, that shouldn't stop us from including other kinds of metadata.
For Tracy, you should prefer the new --iree-hal-capture-executable-sources pass.
Ah, do I need to patch some PR? Trying --iree-hal-capture-executable-sources
gives me:
iree-compile: Unknown command line argument '--iree-hal-capture-executable-sources'. Try: 'tools/iree-compile --help'
iree-compile: Did you mean '--iree-hal-substitute-executable-source'?
I tried the suggested one but it needs a value?
That stays in process (instead of dumping files to disk) and is explicit about the link between source files and tracing. (I'll update the docs soon)
I see, that sounds great. I had assumed that --iree-hal-executable-debug-level=3
would have that behavior as of https://github.com/openxla/iree/pull/16757 . Why wouldn't it? When I specify --iree-hal-executable-debug-level=3
, I'm already conveying "go ahead and bloat my module but make Tracy work!"
Source locations and debug information should be separable. We don't want to have a tight coupling between those concepts.
I need a lecture on this topic. I just don't really understand at the moment.
Let me confirm how the new pass is working... it might be actually be added automatically with debug level >= 3.
Ah, do I need to patch some PR?
Depends on what version/commit you're building at :p. You do need Ben's PR / https://github.com/openxla/iree/commit/15d903960dff911f5ebaf28b467507f62876a5eb
Yep, as mentioned above I am at current top-of-tree (https://github.com/openxla/iree/commit/ee32fc7b74b4aad49fc46d67ea6c1f617416c4d4) which is past that (I checked).
the extra sources are embedded with debug-level>=3
What source locations are passed to LLVM and what source files we embed in the executables are orthogonal. It is useful to be able to point at our new captured sources, but that should be a separate flag. We could have an iree-hal-executable-source-stage= flag that does the location rewrite to a captured stage by name. This way you could have the debug info reference tensor linalg, buffer linag, LLVM dialect, etc. If we added that flag we could remove the implicit source location overwriting of dump-sources, as that was always a hack.
But we want that to be optional and off by default: to users (not internal devs) the source locations should always point back to their source programs by default. And if because of a bug that means this fails to work we need to fix that bug.
So:
I see.
What I'm seeing here is that with both TFLite and Pytorch sources, often we just don't have source locations out of the import process. So tracy source view doesn't show anything by default. Could we in that case and under debug-level>=3 set source locations to the embedded executable sources (which are tensor-linalg) ? Way more useful than no source at all.
Also a technicality - I get your explanation of "What source locations are passed to LLVM and what source files we embed in the executables are orthogonal", but then I don't understand how that works. If the source files in the embedded executable sources are orthogonal to the source locations passed to LLVM, how is the mapping done between locations in the final object code and locations in the embedded executable sources ?
I think I'm missing something - are you saying there are zero locations from tflite and pytorch? Or just some missing locations? I don't think I understand how there can be zero locations - there's always some source file, unless the user strips them and sets them all to UnknownLoc - but users should never do that (and if there's an option for that in those tools it should be removed). If there's some partial missing information then that's the bug to fix.
The pass that captures the locations (which runs several times, each time with a compilation stage name) can set the flag to replace the source locations to point at the captured source - that's exactly what dump-sources does today - the only difference is in one we update the locations and in the other we don't.
Aah so to actually answer that question I had to patch the above-mentioned generated .cpp.inc
parser code and my diff is there: https://gist.github.com/bjacob/ef04e138d571348edea0f5808c43fc3c
And you're right actually the "bad" imported bytecode has only one UnknownLoc
and a large number of NameLoc
. The "good" bytecode obtained by round-trip via text, has all FileLineLoc
.
In the below table, "bad" means exhibits the pathologies discussed here, no source view etc.
Input to iree-compile |
UnknownLoc |
NameLoc |
FileLineColLoc |
---|---|---|---|
PoseNet from iree-import-tflite (bad) |
1 | 85 | 0 |
the same, round-tripped via text (good) | 0 | 0 | 128 |
Now I tried running the same experiment on that OPT-1.3 model but realized that it is text IR to begin with so it just doesn't go through that bytecode-reader code, and I have no idea why it exhibits the pathology.
These are the NameLoc values in this PoseNet model: https://gist.github.com/bjacob/31fdeb0902a42461d5cca6902aef6080
At this point I think I agree that it seems really unnecessarily rigid of the current DebugTranslation.cpp code to just give up on this perfectly good debug info just because of one UnknownLoc
.
Note though, at this point we are trying to solve 2 problems, and a fix at that level would only solve 1 not 2:
I tried again fixing translateLoc
in DebugTranslation.cpp and noticed that in our situation where all the locations are NameLoc, it emits no LLVM DI at all. Indeed, here is what it does with NameLoc:
So it's just ignoring the name string in the NameLoc and walking straight to the ChildLoc, ending with an UnknownLoc, returning null.
Note: the round tripped via text is actually bad. It loses all traces to the original Tflite file and references an intermediate file. It accidentally works. The bug is how namedloc is being treated, well and indeed mixing two concerns together (but one definitely can have a namedloc and debug info and that should work).
This is to continue the investigation in https://github.com/openxla/iree/pull/15756, which is being reverted now.
Summary:
https://github.com/openxla/iree/pull/15756 should have been a 1-line PR, simply adding
-gline-tables-only
to the clang flags that we use to compile ukernels to bitcode. All it does is generate some debug info in the compiled ukernels bitcode.This runs into a bug discussed on this comment thread: https://github.com/openxla/iree/pull/15756/files#r1411583898
iree-import-tflite
is run throughiree-opt
with no flags...mlir-opt
, or even just dumping the IR immediately after loading the input file inmlir-opt
...--iree-hal-dump-executable-sources-to
(https://github.com/openxla/iree/pull/15756/files#r1518324689). The reason is that that causes debug info to be reintroduced after the bug had already caused it to be dropped..codegen.bc
actually does have some limited debug info (on each function, but not on the ops inside each function). But the debug info somehow gets dropped inside ofmlir::translateModuleToLLVMIR
(https://github.com/openxla/iree/pull/15756/files#r1518324689). I haven't been able to debug exactly where, but that's the closest i've been able to get.