scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
847 stars 89 forks source link

avro reader crash #3188

Open martindurant opened 4 months ago

martindurant commented 4 months ago

Version of Awkward Array

ak 2.6.5 ak-cpp 34

Description and code to reproduce

I tried to read the avro file at this location https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata5.avro (downloaded to local) using ak.from_avro_file and got segmentation fault. The file loads fine with fastavro.

macOS 13.6.7, conda install of python 3.10

jpivarski commented 4 months ago

Although there's not much demand (that I know of) for the Avro reader, it's a canary in the coalmine for AwkwardForth and therefore Uproot. If the Avro is not handled correctly, it can fail with a Forth runtime error, but it should not be segfaulting under any circumstances, so this is serious.

I'll try to reproduce it sometime when I have access to my main computer again and see if I can narrow in on the point of failure.

martindurant commented 4 months ago

In addition, from_avro_file fails if passed a file-like rather than a str filename. On this line, the attribute should be .outcontents (as in the branch above, L50) not .outarr.

jpivarski commented 3 months ago

Has PR #3167 solved this issue?

martindurant commented 3 months ago

I didn't see that PR. I can test.

martindurant commented 3 months ago

Updating awkward to 55a9a432 (current main) still crashes. Is there a chance of any cached state from a previous run? awkward-cpp == 36

jpivarski commented 3 months ago

Is there a chance of any cached state from a previous run?

I'm not sure. GitHub Actions recompiles awkward-cpp in each test, so we shouldn't cache from previous runs. (@agoose77, did the test ever have a cache like that?)

Meanwhile, is this an awkward-cpp-only reproducer?

python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

I'm thinking that we should bisect awkward-cpp itself, removing parts of it until there isn't a segfault and then see which part makes the difference. Awkward relies on many parts of awkward-cpp, but awkward-cpp itself is mostly a bag of functions and classes; it should be easy to remove large parts of it and still be able to rely on the above test.

martindurant commented 3 months ago

That line runs fine. The avro reader can read other files (I have a workign test in akimbo for avro), in case that wasn't clear - it only segfaults with the example at the top, a file which reads fine with fastavro.

martindurant commented 3 months ago

but I just did

In [1]: from awkward_cpp.lib._ext import ForthMachine32

In [2]: machine = ForthMachine32("1 2 3 4 5")

In [3]: machine = ForthMachine32("1 2 3 4 5")

In [4]: machine.<tab>
segmentation fault

(only happens if you make the machine twice)

jpivarski commented 3 months ago

The branch jpivarski/test-minimal-awkward-cpp (in do-not-merge PR #3196) removes everything from awkward-cpp except for AwkwardForth. The first commit (0c860e82641f4b1a522cb85d0d1d321765c30215) removes files and the second commit (268076168306d1928b8ec8029cb80f933ccc5d49) reverts to the modern (non-classic) AppleClang loader.

This should be a sufficient test:

nox -s prepare
pip install -vv ./awkward-cpp/
python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

I think the GitHub Actions (which will never pass!) provide artifacts.

This works on my M2 Mac (same for awkward-cpp without removing anything). I have Apple clang version 15.0.0 (clang-1500.3.9.4), so in principle, it shouldn't work. If I understand things, the above doesn't work for you: it segfaults on the second ForthMachine.

martindurant commented 3 months ago

I got segfault only if I made two machines without running the first one

jpivarski commented 3 months ago

Okay, so the test that shows the segfault for you is

python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'

Interestingly, it still doesn't segfault for me. But anyway, we have a much smaller surface area to find the bug. @ianna, are you able to reproduce it? (@henryiii said that it's independent of Intel vs Apple Silicon.)

martindurant commented 3 months ago

No, that doesn't fail either; I got the crash only with <tab> as in my example code (in ipython - not sure exactly what this calls, but it's not dir(), which is fine). I can't find a simple line like yours (yet) which fails, except ak.from_avro_file :| .

ariostas commented 3 months ago

I was curious about this, so I decided to look into it a bit. I'm not sure if this was already known, but I'll document what I found here.

I was only able to reproduce the segfault with ak.from_avro_file, but not with the minimal version Jim made. The segfault is caused due to passing a negative number of elements to ForthOutputBufferOf<unsigned char>::write_uint8. Changing this line

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L2984

to something like

num_items = stack_pop();
if (num_items < 0) num_items = 0;

makes it so that at least it crashes gracefully. As for why negative numbers end up being popped from the stack and interpreted as a length, I'm not sure. I'm not familiar with this code, so I couldn't figure it out. Also, I'm not sure if this is related to the segfault seen from the minimal version.

If someone wants to give me a quick rundown of how the code works, I can look into this further.

jpivarski commented 3 months ago

@ariostas, you found a language-level bug in AwkwardForth (as opposed to a bug in the Avro-reader, which wouldn't worry me as much).

The reason a value popped from the stack is being interpreted as a number of items to read is because this is the implementation of vectorized read words like #i-> (to read $N$ integers from the input source, where $N$ is a number pulled from the Forth stack) and #d-> (to do the same for $N$ doubles). Not checking to see if the value from the stack is negative is a bug: AwkwardForth is not supposed to have any undefined behavior, only valid states and error states.

Most uses of num_items are in for loops like

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L2999

which shouldn't pose a problem: if the num_items is negative, the loop is skipped just as if num_items were zero. One exception is the unusual length integer words, like #2bit->, #3bit->, #4bit->, etc. They set

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3073

and count-down items_remaining until it reaches zero:

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3078

and

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3090

These items_remaining != 0 should be changed to items_remaining > 0 to avoid an infinite loop.

However, the Avro reader couldn't be hitting this problem because Avro doesn't have any of these unusual (but fixed) length integers. (Avro has variable length integers, which are implemented with count < num_items.)

The one case that might cause a segfault is in the write-to-stack macros:

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3255-L3257

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3273-L3275

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3294-L3296

A negative value could break that, and it might happen in the Avro reader when a contiguous array of floating-point or boolean values are read. (The other cases are variable-length integers.)

Your fix of ensuring that num_items is nonnegative would fix that, but so would

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3253

(I'm only saying that because this is highly streamlined code, and doing the check later would slow down fewer cases. I doubt the jump-compare-zero and assignment is much of a speedbump, though.)

So this is great! With this fix, I'd like to know if @martindurant still sees any segfault. In https://github.com/scikit-hep/awkward/issues/3188#issuecomment-2263548792, he found one with tab-complete, which can't be related to this, but I wonder if that's still reproducible. (It's such an odd case, since dir doesn't do it, but tab-complete does. Maybe there was residual left in memory...?)

ariostas commented 3 months ago

Thank you @jpivarski, that sounds great.

Your fix of ensuring that num_items is nonnegative would fix that, but so would

  • changing items_remaining != 0 to items_remaining > 0 and
  • putting the if (num_items < 0) num_items = 0; inside the else { of

You also need the check inside this else if, which is where it segfaults with this file (with the WRITE_DIRECTLY macro).

https://github.com/scikit-hep/awkward/blob/c3c4f6f8eca5ac76fb54c90874c5a499894aa1d5/awkward-cpp/src/libawkward/forth/ForthMachine.cpp#L3211

As for speed, since it's a one-liner I would expect the compiler to figure out the best option, but maybe it can be faster to force it to be branchless with

num_items *= (num_items >= 0)
jpivarski commented 3 months ago

Do you want to write the PR or me?

ariostas commented 3 months ago

I can do it