mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
416 stars 39 forks source link

Support for multidimensional arrays in Croissant #649

Open pierrot0 opened 4 months ago

pierrot0 commented 4 months ago

There is no way for now to express that a field should be a multidimensional array, for example a 4x4 matrix.

An example of dataset with such a need: MatrixCity (https://github.com/city-super/MatrixCity), where there is a rotation matrix field in the data (distributed as JSON in example):

        {
            "frame_index": 0,
            "rot_mat": [
                [
                    -0.009902680292725563,
                    0.0010966990375891328,
                    -0.0008568363264203072,
                    -590.0
                ],
                [
                    -0.0013917317846789956,
                    -0.0078034186735749245,
                    0.006096699275076389,
                    590.0
                ],
                [
                    -8.448758914703092e-10,
                    0.0061566149815917015,
                    0.007880106568336487,
                    200.0
                ],
                [
                    0.0,
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "euler": [
                0.6632251739501953,
                8.44875884808971e-08,
                -3.0019662380218506
            ]
        },

One possibility might be to use JSON schema to represent such an array:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "array",
    "items": {"type": "number"},
    "minItems": 4,
    "maxItems": 4
  },
  "minItems": 4,
  "maxItems": 4
}

The benefit here is that JSON schema is quite complete, so it would be possible to express complex cases, including arrays of different types (useful in multimodal prompts for example).

The downside is that the range of possible schemas is quite large, and there is the risk that some datasets would end-up with one field defined in Croissant, that field type being a complex JSON-schema described object... That would also significantly increase the implementation complexity.

A possible alternative might be to define our own Array dataType in the croissant namespace, similarly to cr:BoundingBox. For example, something like:

{
  "@type": "cr:Field",
  "@id": "recordsetName/rotation_matrix",
  "description": "The rotation matrix.",
  "dataType": "cr:Array",
  "dataTypeParams": {
    "dimensions": [4, 4],
    "dataType": "sc:Float"
  },
  "source": {
    "fileSet": { ... },
     "extract": {
        "jsonPath": "..."
     }
  }
}

What do you folks think?

marcenacp commented 4 months ago

Could it also be implemented as a transform (e.g., by having a new reshape attribute)?

pierrot0 commented 4 months ago

If we do implement this as a transform, what datatype would you use? repeated Number?

One would still need to look at the transform to understand the kind of data to expect, no? Also in the above example, the data is already provided as a 4x4 matrix, which is what we want, so it would seem odd to me to apply a reshape on this.

marcenacp commented 4 months ago

I see. Indeed in that case, the shape would be implicit which is not great. I was thinking of a NumPy-like approach where even scalars would be arrays:

>>> import numpy as np
>>> np.array(1).dtype, np.array(1).shape
(dtype('int64'), ())

So we wouldn't need cr:Array at all:

{
  "@type": "cr:Field",
  "@id": "recordsetName/rotation_matrix",
  "description": "The rotation matrix.",
  "dataType": "cr:Float",
  "shape": [4, 4]
}
pierrot0 commented 4 months ago

I like your above example.