Improve performance of series.ext_array.replace_with_mask()

Currently, arrow misses the support of pyarrow.compute.replace_with_mask for struct arrays: https://github.com/apache/arrow/issues/29558

That's why we have our own implementation used by NestedExtenstionArray.__setitem__(). The implementation has an overhead of creating a len(self)-sized struct array to perform the replacement. This approach would work well when we are going to replace many elements, but when we replacing just few, it would produce a large memory foot-print and probably take a while.

An alternative approach would be copying the original array to np.ndarray[pa.StructScalar], replace the elements in-place, and convert it back:

def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array) -> pa.ChunkedArray:
    """Replace the elements of the array with the value where the mask is True"""
    np_array = np.fromiter(array, dtype=object)
    np_array[mask] = value
    new_pa_array = pa.array(np_array)
    return pa.chunked_array([new_pa_array])

We should create a benchmark and see what works faster and have smaller memory foot-print.

lincc-frameworks / nested-pandas

Improve performance of series.ext_array.replace_with_mask() #52