iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.79k stars 603 forks source link

Segmentation Fault when running GPT2 #12281

Open mariecwhite opened 1 year ago

mariecwhite commented 1 year ago

What happened?

I successfully imported and compiled GPT2 TF with IREE but when running it through the Python bindings, I get a segmentation fault:

collected 4 items / 3 deselected / 1 selected                                                                                                                                                                                                                                                                                                                                                

tank/test_models.py::SharkModuleTest::test_module_gpt2_tf_static_cpu Fatal Python error: Fatal Python error: Fatal Python error: Segmentation fault

Segmentation faultSegmentation fault

Thread 0x00007fb2554731c0 (most recent call first):
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/iree/runtime/function.py", line 154 in _invoke
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/iree/runtime/function.py", line 130 in __call__
  File "/home/mariewhite/github/SHARK/shark/iree_utils/compile_utils.py", line 383 in get_results
  File "/home/mariewhite/github/SHARK/shark/shark_runner.py", line 93 in run
  File "/home/mariewhite/github/SHARK/shark/shark_inference.py", line 138 in __call__
  File "/home/mariewhite/github/SHARK/tank/test_models.py", line 186 in create_and_check_module
  File "/home/mariewhite/github/SHARK/tank/test_models.py", line 353 in test_module
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/parameterized/parameterized.py", line 533 in standalone_func
  File "/usr/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/usr/lib/python3.10/unittest/case.py", line 591 in run
  File "/usr/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/unittest.py", line 330 in runtest
  File "
Extension modules: 
Extension modules: /home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.pynumpy.core._multiarray_umathnumpy.core._multiarray_umath", line 167 in , pytest_runtest_call, numpy.core._multiarray_tests
numpy.core._multiarray_tests  File ", /home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/pluggy/_callers.pynumpy.linalg._umath_linalg, "numpy.linalg._umath_linalg, , line numpy.fft._pocketfft_internal, 39numpy.fft._pocketfft_internal in , _multicallnumpy.random._common, 
numpy.random._common  File , "numpy.random.bit_generator, /home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/pluggy/_manager.pynumpy.random.bit_generator", , line numpy.random._bounded_integers80,  in numpy.random._bounded_integers, _hookexecnumpy.random._mt19937
, numpy.random._mt19937  File , , "numpy.random.mtrandnumpy.random.mtrand/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/pluggy/_hooks.py", , numpy.random._philox, line numpy.random._philox265,  in numpy.random._pcg64__call__, 
numpy.random._pcg64  File , "numpy.random._sfc64/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py", , line , numpy.random._sfc64260numpy.random._generator in <lambda>, 
numpy.random._generator,   File yaml._yaml"/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py, "yaml._yaml, line 339 in , from_callcharset_normalizer.md
,   File charset_normalizer.md"/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py", line 259 in , call_runtest_hook, google.protobuf.pyext._message
google.protobuf.pyext._message,   File grpc._cython.cygrpc, "grpc._cython.cygrpc/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py (total: "17 (total: , line )17
220) in 
call_and_report
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/pluggy/_callers.py"Segmentation fault (core dumped)

This only seems to happen through the Python bindings because when I run iree-benchmark-module separately with the compiled vmfb and input parameters, it works:

iree-benchmark-module --driver=local-task --module=./mhlo_cpu.vmfb --function=forward --input="1x16xi32=[50257 50257 50257 50257 50257 50257 50257 50257 15496 11 428 318 262 4277 2420 13]" --input="1x16xi32=[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]"
2023-02-19T17:36:58-08:00
Running /usr/local/google/home/mariewhite/github/SHARK/iree.venv/lib/python3.10/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (72 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x36)
  L1 Instruction 32 KiB (x36)
  L2 Unified 1024 KiB (x36)
  L3 Unified 25344 KiB (x2)
Load Average: 1.87, 2.67, 2.98
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BM_forward/process_time/real_time       25.5 ms          156 ms           23 items_per_second=39.2562/s

From initial debugging, we are calling the vm module here with device inputs:

[<IREE DeviceArray: shape=[1, 16], dtype=int32>, <IREE DeviceArray: shape=[1, 16], dtype=int32>]

And input values:

array([[50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 15496, 11,   428,   318,   262,  4277,  2420,    13]], dtype=int32)
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)

Steps to reproduce your issue

git clone https://github.com/mariecwhite/SHARK.git
cd SHARK
git checkout gpt2

PYTHON=python3.10 VENV_DIR=iree.venv BENCHMARK=1 IMPORTER=1 USE_IREE=1 ./setup_venv.sh
source iree.venv/bin/activate

python3 generate_sharktank.py
pytest --benchmark tank/test_models.py -k "cpu and gpt2 and tf"

After running, reproducers saved in directory name gpt2_tf_False_cpu* and vmfb saved in ./mhlo_cpu.vmfb

What component(s) does this issue relate to?

Runtime

Version information

Uses iree-compile and iree-runtime versions 20230219.435

allieculp commented 1 year ago

@jpienaar Can you help update the status for this one?

jpienaar commented 1 year ago

Could you confirm if this still reproduces? This looks very similar to an issue that was shown with the caching allocator that was fixed.

allieculp commented 1 year ago

@mariecwhite Assigning to you to see if this is still reproducible.

allieculp commented 1 year ago

@mariecwhite Were you able to reproduce this?

mariecwhite commented 1 year ago

It's currently failing due to other reasons. Let's deprioritize this since GPT2 is no longer in our focus set.