Open martindurant opened 4 months ago
Although there's not much demand (that I know of) for the Avro reader, it's a canary in the coalmine for AwkwardForth and therefore Uproot. If the Avro is not handled correctly, it can fail with a Forth runtime error, but it should not be segfaulting under any circumstances, so this is serious.
I'll try to reproduce it sometime when I have access to my main computer again and see if I can narrow in on the point of failure.
In addition, from_avro_file fails if passed a file-like rather than a str filename. On this line, the attribute should be .outcontents
(as in the branch above, L50) not .outarr
.
Has PR #3167 solved this issue?
I didn't see that PR. I can test.
Updating awkward to 55a9a432 (current main) still crashes. Is there a chance of any cached state from a previous run? awkward-cpp == 36
Is there a chance of any cached state from a previous run?
I'm not sure. GitHub Actions recompiles awkward-cpp in each test, so we shouldn't cache from previous runs. (@agoose77, did the test ever have a cache like that?)
Meanwhile, is this an awkward-cpp-only reproducer?
python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'
I'm thinking that we should bisect awkward-cpp itself, removing parts of it until there isn't a segfault and then see which part makes the difference. Awkward relies on many parts of awkward-cpp, but awkward-cpp itself is mostly a bag of functions and classes; it should be easy to remove large parts of it and still be able to rely on the above test.
That line runs fine. The avro reader can read other files (I have a workign test in akimbo for avro), in case that wasn't clear - it only segfaults with the example at the top, a file which reads fine with fastavro.
but I just did
In [1]: from awkward_cpp.lib._ext import ForthMachine32
In [2]: machine = ForthMachine32("1 2 3 4 5")
In [3]: machine = ForthMachine32("1 2 3 4 5")
In [4]: machine.<tab>
segmentation fault
(only happens if you make the machine twice)
The branch jpivarski/test-minimal-awkward-cpp
(in do-not-merge PR #3196) removes everything from awkward-cpp except for AwkwardForth. The first commit (0c860e82641f4b1a522cb85d0d1d321765c30215) removes files and the second commit (268076168306d1928b8ec8029cb80f933ccc5d49) reverts to the modern (non-classic) AppleClang loader.
This should be a sufficient test:
nox -s prepare
pip install -vv ./awkward-cpp/
python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'
I think the GitHub Actions (which will never pass!) provide artifacts.
This works on my M2 Mac (same for awkward-cpp without removing anything). I have Apple clang version 15.0.0 (clang-1500.3.9.4), so in principle, it shouldn't work. If I understand things, the above doesn't work for you: it segfaults on the second ForthMachine.
I got segfault only if I made two machines without running the first one
Okay, so the test that shows the segfault for you is
python -c 'from awkward_cpp.lib._ext import ForthMachine32 ; vm = ForthMachine32("1 2 3 4 5"); vm = ForthMachine32("1 2 3 4 5"); vm.run(); print(vm.stack)'
Interestingly, it still doesn't segfault for me. But anyway, we have a much smaller surface area to find the bug. @ianna, are you able to reproduce it? (@henryiii said that it's independent of Intel vs Apple Silicon.)
No, that doesn't fail either; I got the crash only with <tab>
as in my example code (in ipython - not sure exactly what this calls, but it's not dir(), which is fine). I can't find a simple line like yours (yet) which fails, except ak.from_avro_file :| .
I was curious about this, so I decided to look into it a bit. I'm not sure if this was already known, but I'll document what I found here.
I was only able to reproduce the segfault with ak.from_avro_file
, but not with the minimal version Jim made. The segfault is caused due to passing a negative number of elements to ForthOutputBufferOf<unsigned char>::write_uint8
. Changing this line
to something like
num_items = stack_pop();
if (num_items < 0) num_items = 0;
makes it so that at least it crashes gracefully. As for why negative numbers end up being popped from the stack and interpreted as a length, I'm not sure. I'm not familiar with this code, so I couldn't figure it out. Also, I'm not sure if this is related to the segfault seen from the minimal version.
If someone wants to give me a quick rundown of how the code works, I can look into this further.
@ariostas, you found a language-level bug in AwkwardForth (as opposed to a bug in the Avro-reader, which wouldn't worry me as much).
The reason a value popped from the stack is being interpreted as a number of items to read is because this is the implementation of vectorized read words like #i->
(to read $N$ integers from the input source, where $N$ is a number pulled from the Forth stack) and #d->
(to do the same for $N$ doubles). Not checking to see if the value from the stack is negative is a bug: AwkwardForth is not supposed to have any undefined behavior, only valid states and error states.
Most uses of num_items
are in for loops like
which shouldn't pose a problem: if the num_items
is negative, the loop is skipped just as if num_items
were zero. One exception is the unusual length integer words, like #2bit->
, #3bit->
, #4bit->
, etc. They set
and count-down items_remaining
until it reaches zero:
and
These items_remaining != 0
should be changed to items_remaining > 0
to avoid an infinite loop.
However, the Avro reader couldn't be hitting this problem because Avro doesn't have any of these unusual (but fixed) length integers. (Avro has variable length integers, which are implemented with count < num_items
.)
The one case that might cause a segfault is in the write-to-stack macros:
A negative value could break that, and it might happen in the Avro reader when a contiguous array of floating-point or boolean values are read. (The other cases are variable-length integers.)
Your fix of ensuring that num_items
is nonnegative would fix that, but so would
items_remaining != 0
to items_remaining > 0
andif (num_items < 0) num_items = 0;
inside the else {
of(I'm only saying that because this is highly streamlined code, and doing the check later would slow down fewer cases. I doubt the jump-compare-zero and assignment is much of a speedbump, though.)
So this is great! With this fix, I'd like to know if @martindurant still sees any segfault. In https://github.com/scikit-hep/awkward/issues/3188#issuecomment-2263548792, he found one with tab-complete, which can't be related to this, but I wonder if that's still reproducible. (It's such an odd case, since dir
doesn't do it, but tab-complete does. Maybe there was residual left in memory...?)
Thank you @jpivarski, that sounds great.
Your fix of ensuring that num_items is nonnegative would fix that, but so would
- changing
items_remaining != 0
toitems_remaining > 0
and- putting the
if (num_items < 0) num_items = 0;
inside theelse {
of
You also need the check inside this else if
, which is where it segfaults with this file (with the WRITE_DIRECTLY
macro).
As for speed, since it's a one-liner I would expect the compiler to figure out the best option, but maybe it can be faster to force it to be branchless with
num_items *= (num_items >= 0)
Do you want to write the PR or me?
I can do it
Version of Awkward Array
ak 2.6.5 ak-cpp 34
Description and code to reproduce
I tried to read the avro file at this location https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata5.avro (downloaded to local) using
ak.from_avro_file
and gotsegmentation fault
. The file loads fine with fastavro.macOS 13.6.7, conda install of python 3.10