linkml / linkml-arrays

Support for loading and dumping N-dimensional arrays in LinkML.
https://linkml.github.io/linkml-arrays
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Division between array serialization and specification #6

Open sneakers-the-rat opened 8 months ago

sneakers-the-rat commented 8 months ago

I've said this a few times when we've talked on zoom during the hackathons, so I don't mean to be a broken record, but one of the places that a lot of prior schema languages have messed up array specification is taking on too much of the weight of specifying the actual encoding of the arrays, rather than being a schematic description that is generic across serializations.

The generality of the current form is pretty good! one way that I see us buying more complexity than we need to though is in this GroupingByArrayOrder idea: https://github.com/linkml/linkml-model/blob/aab9842be0e230c0040688dfc6ffa26696c97827/linkml_model/model/schema/array.yaml#L67-L94

That's an implementation detail of how arrays are stored and indexed - I don't think we should touch the storage part in the schema, and the indexing part is handled by the rest of the array specification, right? I could be missing something that requires that to be specified in the schema, but I think in general it would be good to make a clear separation of concerns here - a decent test is "can this array specification be satisfied in such a way that the schema knows absolutely nothing about the way that the array is serialized?" where the responsibility for getting the array ordering correct is that of the dumper/loader, similarly to how we would expect the dumper/loader to correctly handle chunking and other serialization details.

This is actually what i want to work on at the hackashop - to work on a second set of specifications for declaring serializations, so in a linked data context one would be able to say "this particular array has n linked serializations - this numpy format, that zarr format, etc." without having that be specified in the array's schema. So a way of saying "this particular hash of a binary stream is annotated with being a numpy ndarray with shape (x,y)" and all the other details needed to handle the serialization/deserialization that could be consumed by a generalized dumper/loaders. So we may want to just talk about this next week :)