queryverse / FeatherLib.jl

Low level Julia library for reading feather files
Other
4 stars 3 forks source link

`Feather.materialize` can easily cause segfaults #3

Closed ExpandingMan closed 6 years ago

ExpandingMan commented 6 years ago

Not necessarily directly relevant to this package, but see my comment. Again, I'm not seeing any really elegant solutions to this.

davidanthoff commented 6 years ago

I think I don't fully understand the issue here. The various array types in Arrow.jl have a reference to the underlying data vector, right? Doesn't that keep the memory alive?

ExpandingMan commented 6 years ago

It would, but your problem is that you're assuming that those ArrowVectors have to always exist.

When you do a getindex of a contiguous subset of an ArrowVector it does unsafe_wrap. That prevents free from being called on those pointers, but it does not protect the underlying data buffer that they originated from. Therefore, you can still have the underlying buffer GC'd away and those unsafe_wraped arrays wind up with their data swept out from under them and you segfault.

So, here's what happens: I have a bunch of ArrowVectors, I take some slices of them which become unsafe_wraped Vectors, and I pass those Vectors off to some other part of the program. Meanwhile, I'm not using the ArrowVectors anymore so they get GC'd. Then, you try to access data from one of those Vectors you created and you segfault.

Feather.materialize was the perfect scenario for making this happen: I'd create a Feather.Source containing all of the ArrowVectors and convert them into Vectors (that's what materialize did). But this conversion was a lie, it was just a call to unsafe_wrap. So, I return all of these Vectors, but as soon as I exit the materialize function itself the Feather.Source is gone and with it all references to the original buffer. Then, accessing the Vector immediately segfaults (usually immediately, depends on when the GC depends to kick in).

The solution to this is to stop abusing unsafe_wrap to cheat and create fictitious views wherever possible. I will update you with more details here.

davidanthoff commented 6 years ago

Ah, thanks, got it!

I guess another option would be to return a custom array type in these cases that keeps holding a reference to the underlying data array, right? But that is probably too much hassle. At least from the Query.jl integration point of view your current plan sounds excellent.

ExpandingMan commented 6 years ago

Nope, not a hassle at all, in fact this is already done for me by Base! This is what SubArray does, so if you take a view, you can propagate that view anywhere you like and still don't have to worry about the data buffer getting GC'd. I've thought pretty carefully about the getindex and view semantics by now, and I've come to realize the wisdom of how things are done in Base.

Like I said, I'm also about to create the arrowview function (name not final) which will return to you views as other ArrowVectors instead of SubArrays. This will be useful in cases in which you have arrow formatted data and want to write only a subset of those arrow formatted arrays into another arrow formatted buffer.

davidanthoff commented 6 years ago

Ah, yes, of course! Ok, so essentially this issue here can just be closed, right? Everything seems sorted out on that front.

ExpandingMan commented 6 years ago

Yup, all sorted out.