[RFC]: a simple flat array format for ndarrays

kgryte commented 7 months ago

Description

This RFC proposes introducing a simple flat array format for ndarrays and is inspired by work involving the integration of stdlib in Google Sheets. The motivation for this RFC is to provide a human-readable, non-binary format for serializing and deserializing ndarrays, which is JSON compatible.

At a high level, the format is comprised of a version, header, and list of data buffer elements.

<version> | <header> | <data>

The version component would be comprised of two elements:

[ 'version', '<semver>', ... ]

The first element is the string literal 'version' and is followed by a version string in semver format. It is not anticipated that the patch field of the version string will be used. Only major (breaking changes) and minor (new features/header fields) version fields should update over time.

The header component would be comprised as follows:

'ndarray' | shape | strides | offset | order | dtype | length | capacity | 'data'

and as part of the serialized array

[ ..., 'ndarray', 'shape', ...shape, 'strides', ...strides, 'offset', offset, 'order', order, 'dtype', dtype, 'length', length, 'capacity', capacity, 'data', ... ]

where

'ndarray' is the string literal 'ndarray'.
'shape' is the string literal 'shape'.
...shape is 0 or more dimension sizes. For a zero-dimensional array, no dimension sizes should be present.
'strides' is the string literal 'strides'.
...strides is 1 or more dimension strides. For a zero-dimensional array, one stride should be present, which should be equal to 0.
'offset' is the string literal 'offset'.
offset is a nonnegative integer indicating the index offset in the data buffer marking the first indexed element. The offset of the first indexed element in the serialized format would be version_length + header_length + offset, where one must take into account the version and header lengths.
'order' is the string literal 'order'.
order is either 'row-major' or 'column-major'.
'dtype' is the string literal 'dtype'.
dtype is the ndarray data type string (e.g., 'float64', 'complex128', 'int32', etc).
'length' is the string literal 'length'.
length is a nonnegative integer indicating how many elements are indexed by the ndarray. For a zero-dimensional array, this should equal 1. For non-zero-dimensional arrays, this should be equal to the product of dimension sizes, as listed in shape.
'capacity' is the string literal 'capacity'.
capacity is a nonnegative integer indicating how many elements are in the data buffer. This value should be compatible with the specified ndarray meta data (i.e., shape, strides, offset). For zero-dimensional arrays, this should be greater than or equal to 1.
'data' is the string literal 'data' and should be followed by data buffer elements.

The 'ndarray' string literal is required to be the first header element. The 'data' string literal is required to be the last header element. For the other header elements, each string literal and associated value pair can be arranged in any order. E.g.,

[ ..., 'ndarray', 'capacity', capacity, 'length', length, 'dtype', dtype, 'order', order, 'offset', offset, 'strides', ...strides, 'shape', ...shape, 'data', ... ]

would be valid. Parsers should not assume any particular string literal and value pair order and should instead identify a sub-header element by the string literal indicating its beginning.

The data component is the linear data buffer atop which the serialized ndarray is a view. This data buffer is allowed to contain elements which are outside the view bounds and are not indexed by the view.

Example

The following is an example of a 2x2 ndarray serialized to the proposed linear exchange format:

[
    'version',
    '1.0.0',
    'ndarray',
    'shape',
    2,
    2,
    'strides',
    2,
    1,
    'offset',
    0,
    'order',
    'row-major',
    'dtype',
    'float64',
    'length',
    4,
    'capacity',
    4,
    'data',
    1,
    2,
    3,
    4
]

Note that this particular linear format is easily extendable to CSV/DSV serialization, where each column could represent a different ndarray.

Proposal

As part of this RFC, the following packages are proposed

@stdlib/ndarray/[base/]to-linear-exchange-format: serializes an ndarray to the proposed format.
@stdlib/ndarray/[base/]from-linear-exchange-format: converts a serialized ndarray to an ndarray instance.

where [base/] indicates both base and non-base package versions.

The format name and associated package names are not set in stone. Any naming suggestions are welcome.

Prior Art

ndarray objects can already be serialized to JSON, using the ndarray#toJSON method; however, the format is not a linear data structure (nor should it necessarily be) and does not serialize the data buffer outside of the array view. This prevents creating subsequent views of different sizes atop the same data buffer.
NumPy has an *.npy format; however, this does not include some of the meta data proposed in this RFC and is not human-readable.
NumPy also has an API, savetxt for saving an ndarray to text; however, this is primarily oriented to formatting, similar to @stdlib/string/format.

Related Issues

None.

Questions

No.

Other

No.

Checklist

[X] I have read and understood the Code of Conduct.
[X] Searched for existing issues and pull requests.
[X] The issue name begins with RFC:.

kgryte commented 7 months ago

cc @Planeshifter

Snehil-Shah commented 4 months ago

@kgryte From what I understand, we just have to write methods to serialize and deserialize a multi-dimensional ndarray into a linear array that preserves the original metadata right? Is this issue open for contribution? I would like to work on this.

kgryte commented 4 months ago

@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.

Snehil-Shah commented 4 months ago

@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.

Cool, I was also interested in contributing to the Google Sheets integration project. If you don't mind, can you point me to the hows and wheres? Thanks!

kgryte commented 3 months ago

@Snehil-Shah For that, see https://github.com/stdlib-js/gsheets.

stdlib-js / stdlib