Open wangkuiyi opened 1 year ago
To triage an undeterministic issue like this, I would be very helpful to be able to run the reproduction steps with sanitizers: AddressSanitizer, and separately, ThreadSanitizer. This page says:
You can’t use Thread Sanitizer to diagnose iOS, tvOS, and watchOS apps running on a device. Use Thread Sanitizer only on your 64-bit macOS app, or to diagnose your 64-bit iOS, tvOS, or watchOS app running in Simulator.
Since you write above that this reproduces in Simulator, let's then focus on that.
In particular, task==NULL
sounds like the kind of thing that could be associated with issues that ThreadSanitizer would diagnose.
Even a negative outcome (the sanitizer doesn't see anything) would be useful information in itself, as that would help rule out classes of issues.
We have sanitizers docs here, https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/sanitizers.md
But I wrote that a while ago and it's not optimal. Here's the important steps:
RelWithDebInfo
build type.cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .
.vmfb
module by adding these flags to your iree-compile
command line:iree-compile ... --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false
Then re-build the IREE runtime (iree-run-module
or anything else you're using to load the compiled module) with the IREE_ENABLE_TSAN
CMake option:
cmake -DIREE_ENABLE_TSAN=ON .
cmake --build .
If the reproducing program is your own (finetune
, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=thread
. That is all what IREE_ENABLE_TSAN
does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_TSAN
.
Then re-run your iree-run-module
command line reproducing this issue, using both the TSan-enabled iree-run-module
and the TSan-enabled compiled .vmfb
module.
.vmfb
. Just re-compile the IREE runtime with the CMake option IREE_ENABLE_ASAN=ON
.cmake -DIREE_ENABLE_ASAN=ON .
cmake --build .
If the reproducing program is your own (finetune
, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=address
. That is all what IREE_ENABLE_ASAN
does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_ASAN
.
Thanks @bjacob ! I rebuild the IREE compiler and runtime for macOS/M1 with the following additional CMake flags
-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON
-DIREE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON
-DIREE_ENABLE_MSAN=ON
The building was alright except that I had to fix libyaml a little bit https://github.com/yaml/libyaml/pull/267
Then, I compiled the gpt2.mlir
with the following command:
iree-compile /tmp/gpt2.mlir \
--iree-input-type=mhlo \
--iree-hal-target-backends=llvm-cpu \
-o /tmp/gpt2-san.vmfb \
--iree-llvm-sanitize=thread --iree-llvm-link-embedded=false 2>&1 | tee /tmp/log
It gave me errors like the following. (The more complete error message is at https://gist.github.com/wangkuiyi/b4ef1a867e6f129fe3287a0ef0e1d600. The complete one is too big to upload to GitHub.)
Undefined symbols for architecture arm64:
"___tsan_func_entry", referenced from:
_encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
...
"___tsan_func_exit", referenced from:
_encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
_encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
...
ld: symbol(s) not found for architecture arm64
Linking failed; escaped command line returned exit code 256:
It works if I remove --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false
.
I don't know the fix for these linking errors, but, FYI:
-DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASAN=ON -DIREE_ENABLE_TSAN=ON -DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON -DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON -DIREE_ENABLE_MSAN=ON
The IREE_ENABLE_*SAN
options should be regarded as mutually exclusive. In effect, they are probably overriding each other, passing -fsanitize={address,thread,memory}
where the one passed last overrides others. So here, drop -DIREE_ENABLE_ASAN=ON
and -DIREE_ENABLE_MSAN=ON
.
Interesting! The linker command line from your gist is
/usr/bin/ld -o /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.so -static -dylib -flat_namespace -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.o
and it is itself generated by this code: https://github.com/openxla/iree/blob/1148f720be7e267f248e034b3cfb488633884980/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/internal/UnixLinkerTool.cpp#L82-L92
This is as if on the Apple platform, the TSan instrumentation library needed to be explicitly linked in (?) We need someone with Apple experience here.... maybe @powderluv ?
Maybe try adding "-fsanitize=thread"
to the linker flags (code linked in previous comment). It's suggested at various places including https://github.com/google/sanitizers/issues/701 .
That is, at UnixLinkerTool.cpp:90 (above linked code), add unconditionally
flags.push_back("-fsanitize=thread");
If that works, we'll figure how to do that conditionally.
clang -v -fsantize=thread
helped me. The following command
clang -fsantize /tmp/a.c -o /tmp/a
is equivalent to the following two:
clang /tmp/a.c -c -o /tmp/a.o
and
ld /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.tsan_osx_dynamic.dylib \
-rpath @executable_path \
-rpath /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin \
/tmp/a.o -o /tmp/a \
-lSystem -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread
compiler flags.
(It would obviously be good if this all worked better on apple platforms so just offering an option that night lead through the maze faster -- it is still useful to figure out how to fully enable sanitizers)
Other things that can be done to bisect the area that is having the problem:
I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.
I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.
Agree that this issue does not look like it comes from the generated code.... but TSan specifically (as opposed to other sanitizers) does not allow taking advantage of that in that way, because a TSan-enabled IREE runtime can only call TSan-enabled module code (TSan is an ABI break). Well, it will run, but it will crash.
compile with vmvx (slow but unlikely to crash on generated code)
Ah good idea, that does enable running a TSan-enabled IREE-runtime without having to get TSan to work in module code. My above objection is specific to llvm-cpu target backend.
I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.
+1
@bjacob @wangkuiyi Looks like this went a bit stale, any further update?
Deferring to @wangkuiyi .
@allieculp and @bjacob - I got GPT-2 fine-tuning work a month ago, but via @antiagainst 's Metal GPU backend. This issue comes with the CPU backend, but not the Metal GPU one.
What happened?
After we fixed https://github.com/openxla/iree/issues/12369, I can make GPT-2 generate text well, so I'm moving on to fine-tuning GPT-2.
In https://github.com/iree-org/iree-jax/pull/58, I added a loss function to the file
iree-jax/models/gpt2/model.py
. In JAX-Python, the fine-tuning works well.Then, in https://github.com/iree-org/iree-jax/pull/59, I add the fine-tuning feature as an MLIR function. The compilation went well, and I got the file
/tmp/gpt2.vmfb
.I can run the module using
iree-run-module
Because the finetune function only updates the paramter and does not return anything, the above run prints only
EXEC @finetune
.To check if the finetuning really works on macOS, I wrote a C++ program to run this vmfb file. Sometimes it works well, but sometimes it crashes with
Bus error: 10
.By putting the C++ program into an iOS app written in Objective-C, I can run the app on my iPhone 13 or the iOS Simulator. On these two platforms, the program crashes with
EXC_BAD_ACCESS
almost every time. I am attaching a stack trace from Xcode.Steps to reproduce your issue
gpt2.vmfb
gpt2.vmfb
on macOS/M1.gpt2.vmfb
on the iOS Simulator or an iPhone.What component(s) does this issue relate to?
Runtime
Version information
IREE da22c84fa2261cf5df566029725402d565d1e7b0
Additional context
macOS M1 Max