stdlib-js / stdlib

✨ Standard library for JavaScript and Node.js. ✨
https://stdlib.io
Apache License 2.0
4.2k stars 410 forks source link

[RFC]: a simple flat array format for ndarrays #1140

Open kgryte opened 7 months ago

kgryte commented 7 months ago

Description

This RFC proposes introducing a simple flat array format for ndarrays and is inspired by work involving the integration of stdlib in Google Sheets. The motivation for this RFC is to provide a human-readable, non-binary format for serializing and deserializing ndarrays, which is JSON compatible.

At a high level, the format is comprised of a version, header, and list of data buffer elements.

<version> | <header> | <data>

The version component would be comprised of two elements:

[ 'version', '<semver>', ... ]

The first element is the string literal 'version' and is followed by a version string in semver format. It is not anticipated that the patch field of the version string will be used. Only major (breaking changes) and minor (new features/header fields) version fields should update over time.

The header component would be comprised as follows:

'ndarray' | shape | strides | offset | order | dtype | length | capacity | 'data'

and as part of the serialized array

[ ..., 'ndarray', 'shape', ...shape, 'strides', ...strides, 'offset', offset, 'order', order, 'dtype', dtype, 'length', length, 'capacity', capacity, 'data', ... ]

where

The 'ndarray' string literal is required to be the first header element. The 'data' string literal is required to be the last header element. For the other header elements, each string literal and associated value pair can be arranged in any order. E.g.,

[ ..., 'ndarray', 'capacity', capacity, 'length', length, 'dtype', dtype, 'order', order, 'offset', offset, 'strides', ...strides, 'shape', ...shape, 'data', ... ]

would be valid. Parsers should not assume any particular string literal and value pair order and should instead identify a sub-header element by the string literal indicating its beginning.

The data component is the linear data buffer atop which the serialized ndarray is a view. This data buffer is allowed to contain elements which are outside the view bounds and are not indexed by the view.

Example

The following is an example of a 2x2 ndarray serialized to the proposed linear exchange format:

[
    'version',
    '1.0.0',
    'ndarray',
    'shape',
    2,
    2,
    'strides',
    2,
    1,
    'offset',
    0,
    'order',
    'row-major',
    'dtype',
    'float64',
    'length',
    4,
    'capacity',
    4,
    'data',
    1,
    2,
    3,
    4
]

Note that this particular linear format is easily extendable to CSV/DSV serialization, where each column could represent a different ndarray.

Proposal

As part of this RFC, the following packages are proposed

where [base/] indicates both base and non-base package versions.

The format name and associated package names are not set in stone. Any naming suggestions are welcome.

Prior Art

Related Issues

None.

Questions

No.

Other

No.

Checklist

kgryte commented 7 months ago

cc @Planeshifter

Snehil-Shah commented 4 months ago

@kgryte From what I understand, we just have to write methods to serialize and deserialize a multi-dimensional ndarray into a linear array that preserves the original metadata right? Is this issue open for contribution? I would like to work on this.

kgryte commented 4 months ago

@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.

Snehil-Shah commented 4 months ago

@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.

Cool, I was also interested in contributing to the Google Sheets integration project. If you don't mind, can you point me to the hows and wheres? Thanks!

kgryte commented 3 months ago

@Snehil-Shah For that, see https://github.com/stdlib-js/gsheets.