intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
34 stars 6 forks source link

Alternative merge algorithm without the builder #78

Closed martindurant closed 2 months ago

martindurant commented 2 months ago

@jpivarski , the code in akimbo.io includes, for now, both the previous builder version and this new version of the JOIN algorithm, for you to look at. While the builder is very dynamic and ought to be slower, it also features a growable buffer for the index data, whereas the new version is allocated as big as it can possibly be (it could be resized when finished). Is it possible to use a growable buffer here without the builder?

The merge function takes a trivial amount of time compared to the other things going on in the join function, so I am not sure if this is actually any faster at all. Am I missing something?

jpivarski commented 2 months ago

That _merge function is going to be Numba'd, right?

There is an ak.numba.GrowableBuffer that you can use instead of over-allocating and trimming. (Does trim really release the memory for other objects, or is it just a slice that holds a reference to the original allocation as its base?) This was part of a LayoutBuilder implementation in Numba (scikit-hep/awkward#2408) that never ended up being faster than ArrayBuilder in Numba, and we don't know why.

In principle, a numba.typed.List of numbers is the same kind of thing as a GrowableBuffer, and presumably this is better optimized than our GrowableBuffer. It could be because our GrowableBuffer passes around two array objects, which are separately reference-counted, while Numba's typed list would have only one reference count.

In both the Numba-LayoutBuilder project and yours, we're seeing the dynamic ArrayBuilder being faster than it's supposed to be. It's even accessed through external pointers (no possibility for LLVM to optimize), so I really don't know what's going on there. It shouldn't be too hard to beat, if there aren't other bottlenecks elsewhere, hiding the performance of this part.

martindurant commented 2 months ago

Does trim really release the memory for other objects

Yes it does: it must own the memory and the memory must be contiguous. You would pass a flag asserting that no other reference points to the same array (although python references would be fine and updated in place).

martindurant commented 2 months ago

we're seeing the dynamic ArrayBuilder being faster than it's supposed to be

do you think there's any point in this PR, then? At scale, the over-allocation would be costly, so I would need one of the growable implementations anyway.

jpivarski commented 2 months ago

I don't think there's anything wrong with the PR. In principle, one should be able to do better than ArrayBuilder. I've always taken it as some issue hiding in the implementation of ak.numba.LayoutBuilder, but your project is independent of that. (And I'm assuming that you'll use some growable buffer, eventually.)

As an alternative to the alternative, can the merge be implemented by creating "tags" and "index" arrays of integers using Numba and then pass them into ak.contents.UnionArray.simplified to mix them appropriately? The return value of this function is not necessarily a union array; it tries to homogenize types.

martindurant commented 2 months ago

As an alternative to the alternative

Maybe next time! I think this is nice for now.