Closed ExpandingMan closed 6 years ago
I think I don't fully understand the issue here. The various array types in Arrow.jl have a reference to the underlying data vector, right? Doesn't that keep the memory alive?
It would, but your problem is that you're assuming that those ArrowVector
s have to always exist.
When you do a getindex
of a contiguous subset of an ArrowVector
it does unsafe_wrap
. That prevents free
from being called on those pointers, but it does not protect the underlying data buffer that they originated from. Therefore, you can still have the underlying buffer GC'd away and those unsafe_wrap
ed arrays wind up with their data swept out from under them and you segfault.
So, here's what happens: I have a bunch of ArrowVector
s, I take some slices of them which become unsafe_wrap
ed Vector
s, and I pass those Vector
s off to some other part of the program. Meanwhile, I'm not using the ArrowVector
s anymore so they get GC'd. Then, you try to access data from one of those Vector
s you created and you segfault.
Feather.materialize
was the perfect scenario for making this happen: I'd create a Feather.Source
containing all of the ArrowVector
s and convert them into Vector
s (that's what materialize did). But this conversion was a lie, it was just a call to unsafe_wrap
. So, I return all of these Vector
s, but as soon as I exit the materialize
function itself the Feather.Source
is gone and with it all references to the original buffer. Then, accessing the Vector
immediately segfaults (usually immediately, depends on when the GC depends to kick in).
The solution to this is to stop abusing unsafe_wrap
to cheat and create fictitious views wherever possible. I will update you with more details here.
Ah, thanks, got it!
I guess another option would be to return a custom array type in these cases that keeps holding a reference to the underlying data array, right? But that is probably too much hassle. At least from the Query.jl integration point of view your current plan sounds excellent.
Nope, not a hassle at all, in fact this is already done for me by Base
! This is what SubArray
does, so if you take a view
, you can propagate that view anywhere you like and still don't have to worry about the data buffer getting GC'd. I've thought pretty carefully about the getindex
and view
semantics by now, and I've come to realize the wisdom of how things are done in Base
.
Like I said, I'm also about to create the arrowview
function (name not final) which will return to you view
s as other ArrowVector
s instead of SubArray
s. This will be useful in cases in which you have arrow formatted data and want to write only a subset of those arrow formatted arrays into another arrow formatted buffer.
Ah, yes, of course! Ok, so essentially this issue here can just be closed, right? Everything seems sorted out on that front.
Yup, all sorted out.
Not necessarily directly relevant to this package, but see my comment. Again, I'm not seeing any really elegant solutions to this.