Open phoenix-meadowlark opened 1 year ago
@phoenix-meadowlark does this break the IREE SHARK tests ? (if so it is a show stopper for us).
If the CI is passing then I would guess that this is something more machine specific. I'll update everything to head and see if the issue persists.
The same error still occurs, but I'm unsure how to debug it further. The GPU is a NVIDIA A100-SXM4-40GB.
This breaks the nightly IREE SHARK benchmarks. Only one CUDA test passes and after that first run, we see the same seg fault: https://storage.googleapis.com/shark-benchmark-artifacts/latest/summary.html
@phoenix-meadowlark / @mariecwhite Did you have any further investigations on this from the IREE API side?
@monorimet lets treat this as a high priority to get back up. Could it be any recent changes to the test infra ?
IREE said there weren't any changes that to the API that they could think of causing this. I tried to create a more consistent reproducer by peppering gc.collect()
statements throughout the test, but the failure was refusing to be pinned down, and the exact details of what appears to cause/prevent it have changed slightly since last week. Namely, setting send_to_host=False
no longer always prevents the segmentation fault. It seemed like a good idea to extract a minimal failure out from the test suite, but I don't have a great dev environment on the VM to do so easily.
I am working on reproducing this on one of our A100 instances. It's likely an issue localized to the benchmarking flow -- there are loads of potential issues as soon as the frontends/try-excepts/workarounds are introduced.
To be clear, can you share the method by which you are setting up SHARK?
For CUDA benchmarks it should be something like IMPORTER=1 BENCHMARK=1 ./setup_venv.sh
, though if this occurs on the CI jobs it's probably unrelated to setup. I do have persistent issues with CUDA benchmarks through SHARK pytests which I will give a more detailed report on shortly (as I believe I've seen different failures)
Additionally, I reccomend using the pytest-forked package to handle garbage collection/ memory in SHARK pytests. It's a hard thing to get right and the only way I've gotten CUDA pytest benchmarks to work in suite is via the --forked
option (which simply runs tests in a subprocess, but it handles the pytest integration quite seamlessly.)
Quick update: The issue is almost certainly pertaining to CUDA device management in pytorch+cuda under a subprocess.
If I setup SHARK with USE_IREE=1 ./setup_venv.sh
and run benchmarks (effectively disabling the frontend benchmarks), I no longer experience the issues with segfaults or uninitialized cuda driver.
Here's torch BERT model benchmarks run with tf32 on just SHARK/IREE
(shark.venv) ean@nod-shared-a100-ubuntu:~/SHARK$ pytest --benchmark --tf32 -k "torch and static and bert and cuda" -x --forked
======================================================================================================================================================= test session starts =======================================================================================================================================================
platform linux -- Python 3.11.2, pytest-7.2.2, pluggy-1.0.0 -- /home/ean/SHARK/shark.venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/ean/SHARK, configfile: pytest.ini
plugins: anyio-3.6.2, forked-1.6.0, xdist-3.2.1
collected 268 items / 262 deselected / 6 selected
tank/test_models.py::SharkModuleTest::test_module_albert_base_v2_torch_static_cuda XFAIL (issue with aten.tanh in torch-mlir) [ 16%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_cased_torch_static_cuda PASSED [ 33%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_fp16_torch_static_cuda XFAIL (Numerics Mismatch: Use -s flag to print stderr during pytests.) [ 50%]
tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_torch_static_cuda PASSED
tank/test_models.py::SharkModuleTest::test_module_bert_large_uncased_torch_static_cuda PASSED [ 83%]
tank/test_models.py::SharkModuleTest::test_module_google_mobilebert_uncased_torch_static_cuda PASSED [100%]
==================================================================================================================================== 4 passed, 262 deselected, 2 xfailed in 720.11s (0:12:00) =====================================================================================================================================
Hi - double checking on this P0 issue. More to say? Ok to close or lower priority? Thanks!
I think this can now be lowered since we know the issue and we need to use pytest-forked.
@jpienaar Can you add workstream?
@monorimet Any update on this one?
Running
pytest --benchmark --tf32 tank/test_models.py -k cuda
on the GPU GCP instance I'm using causes an ambiguousFatal Python error: Segmentation fault
when a test passes after another test benchmarks its model (regardless of whether or not that test passes):This happens after
create_and_check_module
finishes executing, but before anything can be printed in thewith
statement on the following line.Adding
send_to_host=False
to this call increate_and_check_module
fixes the segmentation fault without affecting the numerical validation. The issue appears to stem from thisDeviceArray.to_host()
call and might be related to garbage collection. Adding a fakepytest.xfail
at the end ofcreate_and_check_module
delays the segfault for a few minutes, but it just ends up occuring in a different pytest frame down the line:Version Info: Python: 3.11.1 SHARK: 0225434389bf3c60ace7efeeb3ca66e6da60d195 IREE: 20230314.458