Open hombit opened 5 months ago
Benchmarks reveal the problem with single element assignment performance, this rise happened after we switched from ArrowExtensionArray
to a custom implementation of NestedExtensionArray
:
https://lincc-frameworks.github.io/nested-pandas/#benchmarks.AssignSingleDfToNestedSeries.time_run
Currently, arrow misses the support of
pyarrow.compute.replace_with_mask
for struct arrays: https://github.com/apache/arrow/issues/29558That's why we have our own implementation used by
NestedExtenstionArray.__setitem__()
. The implementation has an overhead of creating alen(self)
-sized struct array to perform the replacement. This approach would work well when we are going to replace many elements, but when we replacing just few, it would produce a large memory foot-print and probably take a while.An alternative approach would be copying the original array to
np.ndarray[pa.StructScalar]
, replace the elements in-place, and convert it back:We should create a benchmark and see what works faster and have smaller memory foot-print.