vega / vl-convert

Utilities for converting Vega-Lite specs from the command line and Python
BSD 3-Clause "New" or "Revised" License
96 stars 12 forks source link

vl-convert-python crashes with ARM64 PointerAuthentication error #67

Closed sacundim closed 1 year ago

sacundim commented 1 year ago

Environment:

I am porting a Python app of mine from Vega-Altair 4.2.x to 5.0.1, and altair_saver to vl-convert-python, and I get errors like this after I do a certain number (that I haven't measured) of SVG exports in the same process.

#
# Fatal error in , line 0
# Check failed: Deoptimizer::IsValidReturnAddress(PointerAuthentication::StripPAC(pc), isolate_).
#
#
#
#FailureMessage Object: 0xffff98f6e880
==== C stack trace ===============================

    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x11f9624) [0xffff8876d624]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x910b24) [0xffff87e84b24]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x11f2820) [0xffff88766820]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0xe43d5c) [0xffff883b7d5c]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x97ac54) [0xffff87eeec54]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x977b2c) [0xffff87eebb2c]
    /usr/local/lib/python3.9/site-packages/vl_convert/vl_convert.cpython-39-aarch64-linux-gnu.so(+0x1064234) [0xffff885d8234]

I don't have much of a clue how to troubleshoot this—any suggestions welcome. Searching the web a bit, this PointerAuthentication and StripPAC stuff is related to ARM64 security features to detect if a pointer has been modified, but I have no clue if this would be a stack overflow issue vs. e.g. the Javascript interpreter doing some business with JIT code generation or garbage collection.

Two things that I have managed to establish are:

  1. Running the application multiple times in a row seems to repeatably fail at the same point in its execution (well, at least so far in about 3-4 tries).
  2. However, the input JSON spec that the app attempts to export at that point is not the trigger for the failure. I can tell this because:
    • The app is a single-threaded loop through a list of independent tasks that generate a Vega-Lite JSON spec and export it as SVG;
    • I can trivially change the order in which these tasks are executed;
    • Doing so causes the subtask where I originally observed the failure to succeed, and one that I originally saw succeed to fail.

I'm still doing some investigation to see if e.g. there's a fixed number of invocations of SVG export that triggers the failure or there is one subtask that succeeds but somehow leaves the process in a bad state that causes the subsequent ones to fail, but I don't know if there is a more direct way of figuring out what the cause is so I thought I'd file this ticket right away.

sacundim commented 1 year ago

Update: I just ran the same codebase in an x86 Mac and it worked fine there.

sacundim commented 1 year ago

Searching and reading a bit more, the error message ("Check failed: Deoptimizer::IsValidReturnAddress(PointerAuthentication::StripPAC(pc), isolate_)") seems to come from the V8 engine. Some deep nuance about the way vl-convert is interfacing to V8?

sacundim commented 1 year ago

I wrote a very small standalone test program to try and reproduce this issue, I just bundle all the Vega-Lite specs that are generated by the application where I see the problem and pass them straight into vl-convert-python:

But no dice, the test program runs just fine (takes about 45 seconds on my Mac). The issue must lie in something about the interaction either of the Rust interface to the V8 engine, or how the Rust library that does it interplays with the Python interpreter where it gets embedded.

sacundim commented 1 year ago

Success! I changed the test program a bit to run multiple rounds of the whole set of files, and now it does reproduce the issue.

jonmmease commented 1 year ago

Thanks for the report, investigation, and repro @sacundim. This is really helpful.

I don't have any ideas off the top of my head, but one thing that comes to mind is that we've previously run into an issue with the arm64 linux wheels that cropped up only when cross-compiling from linux on x86. This issue went away when I compiled locally inside an arm64 docker container (running on an M1 mac). In the end we worked around the issue by downgrading deno/v8 and continuing to cross-compile with GitHub actions.

So the first thing I want to try is to see if there's a difference in behavior for wheels that are compiled inside an arm64 docker container (rather than cross compiled).

jonmmease commented 1 year ago

Ok, I was able to reproduce the crash (thanks again for the easy to use repro @sacundim).

I rebuilt the vl_convert_python-0.10.3-cp39-cp39-manylinux_2_31_aarch64.whl wheel locally using a rust base image on my M1 mac, and then reran the tests with this, and it's now completing successfully for me.

Could you give this wheel a try and see if it fixes the issue for you as well?

https://filedn.com/lo5VE4SmtWKXIvNsinHVy7F/packages/vl_convert_python-0.10.3-cp39-cp39-manylinux_2_31_aarch64.whl

The take away is that it looks like we can't trust cross-compilation for building aarch64 images. I'll start looking into options for building on aarch64 directly on GitHub actions.

sacundim commented 1 year ago

I'm trying the hand-compiled one now. A 6-round run of the test case just succeeded, now I'm trying 12 rounds to really stress it. Will later try the original program.

I wonder if you get the compilation process to dump out all the exact compiler options it uses, and compare between the cross-compilation and native environments, if you will see that they're different. Because obviously I could be wrong, but my inclination is to suspect that cross-compilation should produce the same object code given the same source files and compiler options. (Modulo nondeterministic factors like random seeds affecting hash table orders and such)

sacundim commented 1 year ago

The 12-round run with the recompiled package succeeds, and two runs of the original app do so as well.

jonmmease commented 1 year ago

Ok, sounds good! I've had some luck cross compiling working packages on GitHub actions using QEMU. I'll ping you again when I have another wheel to try out. Thanks!

jonmmease commented 1 year ago

Here's a new wheel that I built on GitHub Actions using QEMU (in https://github.com/vega/vl-convert/pull/68). It's also built with manylinux2014 support, so should be just as compatible as the current cross-compiled wheels.

vl_convert_python-0.10.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

This is running the repro successfully for me, so hopefully this takes care of the issue!

jonmmease commented 1 year ago

Should be fixed in the just released 0.11.0. Thanks again for the repro, this wouldn't have gotten fixed otherwise!

sacundim commented 1 year ago

I tried 0.11.0 and just confirmed that my app runs successfully. Thanks for the quick turnaround!