scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Create special exception in jagged fromoffsets for zero-copy arrow conversion #221

Closed nsmith- closed 4 years ago

nsmith- commented 4 years ago

I'm still not sure this is the best idea.

jpivarski commented 4 years ago

I'll wait until you give me the go-ahead.

nsmith- commented 4 years ago

I think this is doing what we want, as for

import awkward
a = awkward.fromiter([[1, 2, 3], [], [4]])
b = awkward.toarrow(a)
c = awkward.fromarrow(b)

the buffer addresses end up the same

>>> c.offsets.base.address == b.buffers()[1].address
True
>>> c.content.base.base.address == b.buffers()[3].address
True
nsmith- commented 4 years ago

Also,

>>> c.offsetsaliased(c.starts, c.stops)
True
nsmith- commented 4 years ago

There may be a double-frombuffer happening in the content?

nsmith- commented 4 years ago

This does what is promised at least.

nsmith- commented 4 years ago

@lgray @jpivarski I think this does what we want and is passing. Maybe we focus the other PR on adding the new LargeListArray type?

lgray commented 4 years ago

So the additional bit in is offsetsaliased is important as well, otherwise it does more copies later on. I rebased the other PR on this one to get this set of changes.

lgray commented 4 years ago

An interesting side result is that some peculiars of the behavior of numpy array bases changes within least significant version number of numpy and gives different performance depending on what you have installed.

I'll change my PR to just putting in LargeListArray, and this is sufficient.

nsmith- commented 4 years ago

So the detail is that in the long list of conditions checked inside offsetsaliased, two take much longer than the rest. For numpy==1.17.3 the check starts.ctypes.data == starts.base.ctypes.data and the corresponding stops one takes 22us on my machine, while every other check is <100ns. In numpy==1.17.4 these long checks are reduced to 3us. They are essentially digging up the actual pointer address from the pyarrow buffer, as indeed starts.base.base.address == starts.ctypes.data. I would guess that numpy might have some slowness in how its interfacing with arrow, as its so new. It seems to be getting better though.

nsmith- commented 4 years ago

Actually, even regular numpy arrays take noticeably longer to dig up the raw pointer than the other checks done in the function. In any case, I don't think there's a way to shortcut checking the condition that the stops pointer is one itemsize after the starts without going through ctypes. Already the check has been done that they share the same buffer. I think this is as good as it gets.

jpivarski commented 4 years ago

In any case, I don't think there's a way to shortcut checking the condition that the stops pointer is one itemsize after the starts without going through ctypes.

There is an alternative: array.data rather than array.ctypes.data. But whereas the latter returns an integer, for which it's easy to add itemsize and compare, the former returns a <memory> object.

I'm flabbergasted that NumPy takes any appreciable time returning a pointer as an integer. They must have been doing something very wrong (performance bug) before.

You might want to start looking at Awkward 1, which has record types and all of the jagged and non-jagged lists (solving the problem above) with the beginnings of a high-level awkward1.Array (checked in yesterday), and everything is equally accessible in Numba. It might be getting close enough to a usable state for your first exploratory tests. In fact, I probably ought to write the little function that would translate old awkward arrays to and from new ones (with the understanding old MaskedArray, UnionArray, ChunkedArray, etc. have no equivalent in the new library yet).

nsmith- commented 4 years ago

the former returns a <memory> object

I couldn't figure out how to compare the raw pointer on two memory objects.