iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 571 forks source link

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

Open wangkuiyi opened 1 year ago

wangkuiyi commented 1 year ago

What happened?

After we fixed https://github.com/openxla/iree/issues/12369, I can make GPT-2 generate text well, so I'm moving on to fine-tuning GPT-2.

In https://github.com/iree-org/iree-jax/pull/58, I added a loss function to the file iree-jax/models/gpt2/model.py. In JAX-Python, the fine-tuning works well.

Then, in https://github.com/iree-org/iree-jax/pull/59, I add the fine-tuning feature as an MLIR function. The compilation went well, and I got the file /tmp/gpt2.vmfb.

I can run the module using iree-run-module

15:09 $ iree-run-module --module=/tmp/gpt2.vmfb --device=local-task --function=finetune --input="1x64xi32=13" --input="1x64xi32=13" --input="1xi32=10"
EXEC @finetune

Because the finetune function only updates the paramter and does not return anything, the above run prints only EXEC @finetune.

To check if the finetuning really works on macOS, I wrote a C++ program to run this vmfb file. Sometimes it works well, but sometimes it crashes with Bus error: 10.

(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…6]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Joy.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…5]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Bus error: 10

By putting the C++ program into an iOS app written in Objective-C, I can run the app on my iPhone 13 or the iOS Simulator. On these two platforms, the program crashes with EXC_BAD_ACCESS almost every time. I am attaching a stack trace from Xcode.

Screenshot 2023-03-06 at 10 11 26 AM

Steps to reproduce your issue

  1. Build a very recent version of IREE after the fix https://github.com/openxla/iree/issues/12369
  2. Use the branch of IREE-JAX in https://github.com/iree-org/iree-jax/pull/59/ to generate gpt2.vmfb
  3. Build the sample C++ program that executes gpt2.vmfb on macOS/M1.
  4. Build the sample iOS app that executes gpt2.vmfb on the iOS Simulator or an iPhone.

What component(s) does this issue relate to?

Runtime

Version information

IREE da22c84fa2261cf5df566029725402d565d1e7b0

Additional context

macOS M1 Max

bjacob commented 1 year ago

To triage an undeterministic issue like this, I would be very helpful to be able to run the reproduction steps with sanitizers: AddressSanitizer, and separately, ThreadSanitizer. This page says:

You can’t use Thread Sanitizer to diagnose iOS, tvOS, and watchOS apps running on a device. Use Thread Sanitizer only on your 64-bit macOS app, or to diagnose your 64-bit iOS, tvOS, or watchOS app running in Simulator.

Since you write above that this reproduces in Simulator, let's then focus on that.

In particular, task==NULL sounds like the kind of thing that could be associated with issues that ThreadSanitizer would diagnose.

Even a negative outcome (the sanitizer doesn't see anything) would be useful information in itself, as that would help rule out classes of issues.

We have sanitizers docs here, https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/sanitizers.md

But I wrote that a while ago and it's not optimal. Here's the important steps:

  1. For both sanitizers, select the RelWithDebInfo build type.
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .
  1. For ThreadSanitizer, first re-compile your .vmfb module by adding these flags to your iree-compile command line:
iree-compile ...  --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false

Then re-build the IREE runtime (iree-run-module or anything else you're using to load the compiled module) with the IREE_ENABLE_TSAN CMake option:

cmake -DIREE_ENABLE_TSAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=thread. That is all what IREE_ENABLE_TSAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_TSAN.

Then re-run your iree-run-module command line reproducing this issue, using both the TSan-enabled iree-run-module and the TSan-enabled compiled .vmfb module.

  1. For AddressSanitizer, it's easier as you don't need to re-compile the .vmfb. Just re-compile the IREE runtime with the CMake option IREE_ENABLE_ASAN=ON.
cmake -DIREE_ENABLE_ASAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=address. That is all what IREE_ENABLE_ASAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_ASAN.

wangkuiyi commented 1 year ago

Thanks @bjacob ! I rebuild the IREE compiler and runtime for macOS/M1 with the following additional CMake flags

-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON 
-DIREE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON 
-DIREE_ENABLE_MSAN=ON

The building was alright except that I had to fix libyaml a little bit https://github.com/yaml/libyaml/pull/267

Then, I compiled the gpt2.mlir with the following command:

 iree-compile /tmp/gpt2.mlir \
   --iree-input-type=mhlo \
   --iree-hal-target-backends=llvm-cpu  \
   -o /tmp/gpt2-san.vmfb \
   --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false 2>&1 | tee /tmp/log

It gave me errors like the following. (The more complete error message is at https://gist.github.com/wangkuiyi/b4ef1a867e6f129fe3287a0ef0e1d600. The complete one is too big to upload to GitHub.)

 Undefined symbols for architecture arm64:
  "___tsan_func_entry", referenced from:
      _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      ...
  "___tsan_func_exit", referenced from:
      _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
      _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
      ...
ld: symbol(s) not found for architecture arm64
Linking failed; escaped command line returned exit code 256:

It works if I remove --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false.

bjacob commented 1 year ago

I don't know the fix for these linking errors, but, FYI:

-DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASAN=ON -DIREE_ENABLE_TSAN=ON -DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON -DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON -DIREE_ENABLE_MSAN=ON

The IREE_ENABLE_*SAN options should be regarded as mutually exclusive. In effect, they are probably overriding each other, passing -fsanitize={address,thread,memory} where the one passed last overrides others. So here, drop -DIREE_ENABLE_ASAN=ON and -DIREE_ENABLE_MSAN=ON.

bjacob commented 1 year ago

Interesting! The linker command line from your gist is

/usr/bin/ld -o /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.so -static -dylib -flat_namespace -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.o

and it is itself generated by this code: https://github.com/openxla/iree/blob/1148f720be7e267f248e034b3cfb488633884980/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/internal/UnixLinkerTool.cpp#L82-L92

This is as if on the Apple platform, the TSan instrumentation library needed to be explicitly linked in (?) We need someone with Apple experience here.... maybe @powderluv ?

bjacob commented 1 year ago

Maybe try adding "-fsanitize=thread" to the linker flags (code linked in previous comment). It's suggested at various places including https://github.com/google/sanitizers/issues/701 .

That is, at UnixLinkerTool.cpp:90 (above linked code), add unconditionally

flags.push_back("-fsanitize=thread"); 

If that works, we'll figure how to do that conditionally.

wangkuiyi commented 1 year ago

clang -v -fsantize=thread helped me. The following command

clang -fsantize /tmp/a.c -o /tmp/a

is equivalent to the following two:

clang /tmp/a.c -c -o /tmp/a.o

and

ld /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.tsan_osx_dynamic.dylib \
  -rpath @executable_path \
  -rpath /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin \
  /tmp/a.o -o /tmp/a \
  -lSystem -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
stellaraccident commented 1 year ago

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

(It would obviously be good if this all worked better on apple platforms so just offering an option that night lead through the maze faster -- it is still useful to figure out how to fully enable sanitizers)

stellaraccident commented 1 year ago

Other things that can be done to bisect the area that is having the problem:

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

bjacob commented 1 year ago

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

Agree that this issue does not look like it comes from the generated code.... but TSan specifically (as opposed to other sanitizers) does not allow taking advantage of that in that way, because a TSan-enabled IREE runtime can only call TSan-enabled module code (TSan is an ABI break). Well, it will run, but it will crash.

compile with vmvx (slow but unlikely to crash on generated code)

Ah good idea, that does enable running a TSan-enabled IREE-runtime without having to get TSan to work in module code. My above objection is specific to llvm-cpu target backend.

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

+1

allieculp commented 1 year ago

@bjacob @wangkuiyi Looks like this went a bit stale, any further update?

bjacob commented 1 year ago

Deferring to @wangkuiyi .

wangkuiyi commented 1 year ago

@allieculp and @bjacob - I got GPT-2 fine-tuning work a month ago, but via @antiagainst 's Metal GPU backend. This issue comes with the CPU backend, but not the Metal GPU one.