lincc-frameworks / nested-pandas

Efficient Pandas representation for nested associated datasets.
https://nested-pandas.readthedocs.io
MIT License
9 stars 1 forks source link

Improve performance of series.ext_array.replace_with_mask() #52

Open hombit opened 5 months ago

hombit commented 5 months ago

Currently, arrow misses the support of pyarrow.compute.replace_with_mask for struct arrays: https://github.com/apache/arrow/issues/29558

That's why we have our own implementation used by NestedExtenstionArray.__setitem__(). The implementation has an overhead of creating a len(self)-sized struct array to perform the replacement. This approach would work well when we are going to replace many elements, but when we replacing just few, it would produce a large memory foot-print and probably take a while.

An alternative approach would be copying the original array to np.ndarray[pa.StructScalar], replace the elements in-place, and convert it back:

def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array) -> pa.ChunkedArray:
    """Replace the elements of the array with the value where the mask is True"""
    np_array = np.fromiter(array, dtype=object)
    np_array[mask] = value
    new_pa_array = pa.array(np_array)
    return pa.chunked_array([new_pa_array])

We should create a benchmark and see what works faster and have smaller memory foot-print.

hombit commented 5 months ago

Benchmarks reveal the problem with single element assignment performance, this rise happened after we switched from ArrowExtensionArray to a custom implementation of NestedExtensionArray:

https://lincc-frameworks.github.io/nested-pandas/#benchmarks.AssignSingleDfToNestedSeries.time_run