Open pierrot0 opened 4 months ago
Could it also be implemented as a transform
(e.g., by having a new reshape
attribute)?
If we do implement this as a transform, what datatype would you use? repeated Number?
One would still need to look at the transform to understand the kind of data to expect, no? Also in the above example, the data is already provided as a 4x4 matrix, which is what we want, so it would seem odd to me to apply a reshape on this.
I see. Indeed in that case, the shape would be implicit which is not great. I was thinking of a NumPy-like approach where even scalars would be arrays:
>>> import numpy as np
>>> np.array(1).dtype, np.array(1).shape
(dtype('int64'), ())
So we wouldn't need cr:Array
at all:
{
"@type": "cr:Field",
"@id": "recordsetName/rotation_matrix",
"description": "The rotation matrix.",
"dataType": "cr:Float",
"shape": [4, 4]
}
I like your above example.
There is no way for now to express that a field should be a multidimensional array, for example a 4x4 matrix.
An example of dataset with such a need: MatrixCity (https://github.com/city-super/MatrixCity), where there is a rotation matrix field in the data (distributed as JSON in example):
One possibility might be to use JSON schema to represent such an array:
The benefit here is that JSON schema is quite complete, so it would be possible to express complex cases, including arrays of different types (useful in multimodal prompts for example).
The downside is that the range of possible schemas is quite large, and there is the risk that some datasets would end-up with one field defined in Croissant, that field type being a complex JSON-schema described object... That would also significantly increase the implementation complexity.
A possible alternative might be to define our own
Array
dataType in the croissant namespace, similarly tocr:BoundingBox
. For example, something like:What do you folks think?