Feature Proposal: Optimized `RowObject` Implementation

mcovalt commented 2 weeks ago

First off, great library!

The rowObject concept is used for transitioning software with row-wise access patterns to reading Arrow data. Currently, rowObject fully deserializes a single row from an Arrow RecordBatch, creating a new object. This approach provides native read performance in the browser but increases garbage collection pressure due to the required data copy.

The official Arrow JS library uses Proxy objects instead, but as your benchmarks show, they’re significantly slower for read access.

Proposal

I’ve experimented with an alternative that balances read performance and memory usage. Instead of using Proxy objects, I suggest using a factory to create a prototype object with pre-defined getters:

// Simplified example using Arrow JS
const RowIndex = Symbol();

class RowObjectProto {
  private [RowIndex]: number;
  constructor(i: number) {
    this[RowIndex] = i;
  }
}

function rowFactory(vector: Vector<Struct<T>>): RowObjectConstructor<T> {
  const props = {};
  const commonProps = { enumerable: true, configurable: false };

  for (let i = 0; i < vector.numChildren; i++) {
    const fieldName = vector.type.children[i]!.name;
    const childVector = vector.getChildAt(i)!;

    if (isStruct(childVector)) {
      const NestedRowObjectConstructor = rowFactory(childVector);
      props[fieldName] = {
        get() {
          return new NestedRowObjectConstructor(this[RowIndex]);
        },
        ...commonProps,
      };
    } else {
      props[fieldName] = {
        get() {
          return childVector.get(this[RowIndex]);
        },
        ...commonProps,
      };
    }
  }

  const RowObject = class extends RowObjectProto {};
  Object.defineProperties(RowObject.prototype, props);
  return RowObject;
}

// Use example
const ManfuacturedRowObject = rowFactory(someStructVector);
const obj = new ManfuacturedRowObject(0)
// obj looks just like a normal object representing someStructVector at index 0, but is pretty tiny

Key Points

Performance: Creating new RowObjects is cheap, just an integer assignment. The objects are lightweight, putting minimal pressure on the garbage collector. Closure-based access to the underlying RecordBatch is very fast.
Memory: This method minimizes memory usage by avoiding full deserialization and reduces GC overhead.
Closure vs indirection: There are two approaches:
- RowObject-per-RecordBatch: Better for large RecordBatches with many rows.
- RowObject-per-Schema: Better for handling multiple smaller RecordBatches, though slightly slower due to added indirection.

Considerations

No Caching: The lack of caching in the getters avoids performance penalties from branching but could be revisited.
Array Access: Lazily creating arrays of RowObjects on access works well for single object access but can degrade performance in loops. Possible solutions include using iterators, creating arrays at instantiation, or caching them—each with trade-offs.

Summary

A prototype-based RowObject can achieve near-native access speed with minimal memory usage, but it adds complexity and has side effects, especially with array access. While this approach is beneficial for legacy systems needing row-wise data, it’s not aligned with the optimal use of Arrow’s columnar structure, and thus, may not be something worth maintaining.

Would you be interested in including this approach in the library? If so, I can start working on a PR.

jheer commented 2 weeks ago

Thanks for the suggestion! Your strategy appears to be the same one that I used in the vega arrow loader: https://github.com/vega/vega-loader-arrow/blob/main/src/arrow.js

It should indeed be much better than Proxy objects, but I’d love to see some benchmark numbers to compare with the current native object approach. There will also be nuances around non-memoized extraction (eg, for utf8 strings) that can harm performance in the case of repeated property lookups. In short, there are virtues to simplicity and I’m not yet sure if we want/need to optimize row objects further. More evidence around memory pressure issues and their practical impacts would also be useful. Thanks again!

jheer commented 2 weeks ago

I added a PR (#12) that explores this idea further. We find better performance (for single-access use) in addition to reduced memory demand. However, the trade-off is that the proxy objects do not support common object utilities such as Object.keys, Object.values, and spreading { ...object }. I tested further wrapping these proxies with an actual Proxy handler that redirects only the "own" properties lookup, but this significantly degrades performance.

@mcovalt, let me know if you have any thoughts or reactions.

jheer commented 1 day ago

Added in #15.

uwdata / flechette