nod-ai / SHARK-Studio

SHARK Studio -- Web UI for SHARK+IREE High Performance Machine Learning Distribution
Apache License 2.0
1.42k stars 171 forks source link

`send_to_host=True` causes segmentation fault when benchmarking using cuda in `tank/test_models.py` #1197

Open phoenix-meadowlark opened 1 year ago

phoenix-meadowlark commented 1 year ago

Running pytest --benchmark --tf32 tank/test_models.py -k cuda on the GPU GCP instance I'm using causes an ambiguous Fatal Python error: Segmentation fault when a test passes after another test benchmarks its model (regardless of whether or not that test passes):

Fatal Python error: Segmentation fault

Thread 0x00007f1a0ffc2640 (most recent call first):
  File "/usr/local/lib/python3.11/threading.py", line 324 in wait
  File "/usr/local/lib/python3.11/threading.py", line 622 in wait
  File ".../site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/local/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/local/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x00007f1a323f3740 (most recent call first):
  File "/home/meadowlark/SHARK/tank/test_models.py", line 362 in test_module
  File ".../site-packages/parameterized/parameterized.py", line 533 in standalone_func
  File "/usr/local/lib/python3.11/unittest/case.py", line 579 in _callTestMethod
  File "/usr/local/lib/python3.11/unittest/case.py", line 623 in run
  File "/usr/local/lib/python3.11/unittest/case.py", line 678 in __call__
  File ".../site-packages/_pytest/unittest.py", line 330 in runtest
  File ".../site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File ".../site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../site-packages/_pytest/runner.py", line 260 in <lambda>
  File ".../site-packages/_pytest/runner.py", line 339 in from_call
  File ".../site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File ".../site-packages/_pytest/runner.py", line 220 in call_and_report
  File ".../site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File ".../site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File ".../site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File ".../site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../site-packages/_pytest/main.py", line 324 in _main
  File ".../site-packages/_pytest/main.py", line 270 in wrap_session
  File ".../site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File ".../site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../site-packages/_pytest/config/__init__.py", line 167 in main
  File ".../site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/meadowlark/SHARK/shark.venv/bin/pytest", line 8 in <module>

This happens after create_and_check_module finishes executing, but before anything can be printed in the with statement on the following line.

Adding send_to_host=False to this call in create_and_check_module fixes the segmentation fault without affecting the numerical validation. The issue appears to stem from this DeviceArray.to_host() call and might be related to garbage collection. Adding a fake pytest.xfail at the end of create_and_check_module delays the segfault for a few minutes, but it just ends up occuring in a different pytest frame down the line:

Fatal Python error: Segmentation fault

Thread 0x00007f64bb495640 (most recent call first):
  File "/usr/local/lib/python3.11/threading.py", line 324 in wait
  File "/usr/local/lib/python3.11/threading.py", line 622 in wait
  File "...site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/local/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/local/lib/python3.11/threading.py", line 995 in _bootstrap

Current thread 0x00007f64dd8c5740 (most recent call first):
  Garbage-collecting
  File "/usr/local/lib/python3.11/ast.py", line 50 in parse
  File "...site-packages/_pytest/_code/source.py", line 185 in getstatementrange_ast
  File "...site-packages/_pytest/_code/code.py", line 263 in getsource
  File "...site-packages/_pytest/_code/code.py", line 722 in _getentrysource
  File "...site-packages/_pytest/_code/code.py", line 814 in repr_traceback_entry
  File "...site-packages/_pytest/_code/code.py", line 871 in repr_traceback
  File "...site-packages/_pytest/_code/code.py", line 944 in repr_excinfo
  File "...site-packages/_pytest/_code/code.py", line 669 in getrepr
  File "...site-packages/_pytest/nodes.py", line 484 in _repr_failure_py
  File "...site-packages/_pytest/python.py", line 1823 in repr_failure
  File "...site-packages/_pytest/reports.py", line 349 in from_item_and_call
  File "...site-packages/_pytest/runner.py", line 366 in pytest_runtest_makereport
  File "...site-packages/pluggy/_callers.py", line 39 in _multicall
  File "...site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "...site-packages/pluggy/_hooks.py", line 265 in __call__
  File "...site-packages/_pytest/runner.py", line 222 in call_and_report
  File "...site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "...site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "...site-packages/pluggy/_callers.py", line 39 in _multicall
  File "...site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "...site-packages/pluggy/_hooks.py", line 265 in __call__
  File "...site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "...site-packages/pluggy/_callers.py", line 39 in _multicall
  File "...site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "...site-packages/pluggy/_hooks.py", line 265 in __call__
  File "...site-packages/_pytest/main.py", line 324 in _main
  File "...site-packages/_pytest/main.py", line 270 in wrap_session
  File "...site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "...site-packages/pluggy/_callers.py", line 39 in _multicall
  File "...site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "...site-packages/pluggy/_hooks.py", line 265 in __call__
  File "...site-packages/_pytest/config/__init__.py", line 167 in main
  File "...site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/meadowlark/SHARK/shark.venv/bin/pytest", line 8 in <module>

Version Info: Python: 3.11.1 SHARK: 0225434389bf3c60ace7efeeb3ca66e6da60d195 IREE: 20230314.458

powderluv commented 1 year ago

@phoenix-meadowlark does this break the IREE SHARK tests ? (if so it is a show stopper for us).

phoenix-meadowlark commented 1 year ago

If the CI is passing then I would guess that this is something more machine specific. I'll update everything to head and see if the issue persists.

phoenix-meadowlark commented 1 year ago

The same error still occurs, but I'm unsure how to debug it further. The GPU is a NVIDIA A100-SXM4-40GB.

mariecwhite commented 1 year ago

This breaks the nightly IREE SHARK benchmarks. Only one CUDA test passes and after that first run, we see the same seg fault: https://storage.googleapis.com/shark-benchmark-artifacts/latest/summary.html

powderluv commented 1 year ago

@phoenix-meadowlark / @mariecwhite Did you have any further investigations on this from the IREE API side?

@monorimet lets treat this as a high priority to get back up. Could it be any recent changes to the test infra ?

phoenix-meadowlark commented 1 year ago

IREE said there weren't any changes that to the API that they could think of causing this. I tried to create a more consistent reproducer by peppering gc.collect() statements throughout the test, but the failure was refusing to be pinned down, and the exact details of what appears to cause/prevent it have changed slightly since last week. Namely, setting send_to_host=False no longer always prevents the segmentation fault. It seemed like a good idea to extract a minimal failure out from the test suite, but I don't have a great dev environment on the VM to do so easily.

monorimet commented 1 year ago

I am working on reproducing this on one of our A100 instances. It's likely an issue localized to the benchmarking flow -- there are loads of potential issues as soon as the frontends/try-excepts/workarounds are introduced.

To be clear, can you share the method by which you are setting up SHARK? For CUDA benchmarks it should be something like IMPORTER=1 BENCHMARK=1 ./setup_venv.sh, though if this occurs on the CI jobs it's probably unrelated to setup. I do have persistent issues with CUDA benchmarks through SHARK pytests which I will give a more detailed report on shortly (as I believe I've seen different failures)

monorimet commented 1 year ago

Additionally, I reccomend using the pytest-forked package to handle garbage collection/ memory in SHARK pytests. It's a hard thing to get right and the only way I've gotten CUDA pytest benchmarks to work in suite is via the --forked option (which simply runs tests in a subprocess, but it handles the pytest integration quite seamlessly.)

monorimet commented 1 year ago

Quick update: The issue is almost certainly pertaining to CUDA device management in pytorch+cuda under a subprocess. If I setup SHARK with USE_IREE=1 ./setup_venv.sh and run benchmarks (effectively disabling the frontend benchmarks), I no longer experience the issues with segfaults or uninitialized cuda driver.

Here's torch BERT model benchmarks run with tf32 on just SHARK/IREE

(shark.venv) ean@nod-shared-a100-ubuntu:~/SHARK$ pytest --benchmark --tf32 -k "torch and static and bert and cuda" -x --forked                                                                                                                                                                                     
======================================================================================================================================================= test session starts =======================================================================================================================================================
platform linux -- Python 3.11.2, pytest-7.2.2, pluggy-1.0.0 -- /home/ean/SHARK/shark.venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/ean/SHARK, configfile: pytest.ini
plugins: anyio-3.6.2, forked-1.6.0, xdist-3.2.1
collected 268 items / 262 deselected / 6 selected                                                                                                                                                                                                                                                                                 

tank/test_models.py::SharkModuleTest::test_module_albert_base_v2_torch_static_cuda XFAIL (issue with aten.tanh in torch-mlir)                                                                                                                                                                                               [ 16%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_cased_torch_static_cuda PASSED                                                                                                                                                                                                                                  [ 33%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_fp16_torch_static_cuda XFAIL (Numerics Mismatch: Use -s flag to print stderr during pytests.)                                                                                                                                                           [ 50%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_torch_static_cuda PASSED 
tank/test_models.py::SharkModuleTest::test_module_bert_large_uncased_torch_static_cuda PASSED                                                                                                                                                                                                                               [ 83%]
tank/test_models.py::SharkModuleTest::test_module_google_mobilebert_uncased_torch_static_cuda PASSED                                                                                                                                                                                                                        [100%]

==================================================================================================================================== 4 passed, 262 deselected, 2 xfailed in 720.11s (0:12:00) =====================================================================================================================================
aaron-schneider commented 1 year ago

Hi - double checking on this P0 issue. More to say? Ok to close or lower priority? Thanks!

powderluv commented 1 year ago

I think this can now be lowered since we know the issue and we need to use pytest-forked.

allieculp commented 1 year ago

@jpienaar Can you add workstream?

allieculp commented 1 year ago

@monorimet Any update on this one?