Closed mcovalt closed 1 day ago
Thanks for the suggestion! Your strategy appears to be the same one that I used in the vega arrow loader: https://github.com/vega/vega-loader-arrow/blob/main/src/arrow.js
It should indeed be much better than Proxy objects, but I’d love to see some benchmark numbers to compare with the current native object approach. There will also be nuances around non-memoized extraction (eg, for utf8 strings) that can harm performance in the case of repeated property lookups. In short, there are virtues to simplicity and I’m not yet sure if we want/need to optimize row objects further. More evidence around memory pressure issues and their practical impacts would also be useful. Thanks again!
I added a PR (#12) that explores this idea further. We find better performance (for single-access use) in addition to reduced memory demand. However, the trade-off is that the proxy objects do not support common object utilities such as Object.keys
, Object.values
, and spreading { ...object }
. I tested further wrapping these proxies with an actual Proxy
handler that redirects only the "own" properties lookup, but this significantly degrades performance.
@mcovalt, let me know if you have any thoughts or reactions.
Added in #15.
First off, great library!
The
rowObject
concept is used for transitioning software with row-wise access patterns to reading Arrow data. Currently,rowObject
fully deserializes a single row from an ArrowRecordBatch
, creating a new object. This approach provides native read performance in the browser but increases garbage collection pressure due to the required data copy.The official Arrow JS library uses Proxy objects instead, but as your benchmarks show, they’re significantly slower for read access.
Proposal
I’ve experimented with an alternative that balances read performance and memory usage. Instead of using Proxy objects, I suggest using a factory to create a prototype object with pre-defined getters:
Key Points
RowObjects
is cheap, just an integer assignment. The objects are lightweight, putting minimal pressure on the garbage collector. Closure-based access to the underlyingRecordBatch
is very fast.RowObject
-per-RecordBatch
: Better for largeRecordBatches
with many rows.RowObject
-per-Schema
: Better for handling multiple smallerRecordBatches
, though slightly slower due to added indirection.Considerations
RowObjects
on access works well for single object access but can degrade performance in loops. Possible solutions include using iterators, creating arrays at instantiation, or caching them—each with trade-offs.Summary
A prototype-based
RowObject
can achieve near-native access speed with minimal memory usage, but it adds complexity and has side effects, especially with array access. While this approach is beneficial for legacy systems needing row-wise data, it’s not aligned with the optimal use of Arrow’s columnar structure, and thus, may not be something worth maintaining.Would you be interested in including this approach in the library? If so, I can start working on a PR.