Closed ialarmedalien closed 1 year ago
Great detective work and valiant attempts, but indeed this is not supported, largely by intention. LinkML tries to be unopinionated, but doing things such as modeling dates as lists does go against the grain.
Having said that, we should have some kind of supported. I envision 3 alternate approaches
I'll attempt to explain each
Multidimensional For the Multidimensional approach, it helps to think of a use case with multidimensional data. This is the kind of thing the HDMF format excels at. Let's say I have a dataset of temperatures at points on the globe at different time points.
The more LinkMLesque way to model this is something like
Dataset:
attributes:
observations:
range: Observation
multivalued: true
Observation:
attributes:
lat:
range: decimal
long:
range: decimal
height_in_meters:
range: float
unit:
ucum_code: m
temp_in_kelvin:
range: float
unit:
ucum_code: K
Assume all fields are required. optionally we can add a compound unique key of (lat, long, height).
You can also imagine adding a time dimension here
There isn't any need for LoLs in our modeling. However, the default way of storing this as json/yaml may be inefficient, and data scientists may prefer to manipulate multidimensional arrays.
There are a few different ways of serializing, including a flat list with accomanying metadata on the dimensions:
But this could also be as a LoLoL
observations:
_dimensions:
lat:
- 100, 120, 140
long:
- 100, 120, 140
height:
- 0
- 20
- 40
_data:
- - - 292
- 293
- 292
- - 292
- 290
- 294
- - 293
- 293
- 291
- - - 296
- 295
- 293
- - 296
- 296
- 296
- - 296
- 296
- 296
- - - 296
- 298
- 298
- - 298
- 298
- 296
- - 297
- 297
- 298
A specific example of this in our domain is biom format:
Object-as-list Serialization
Here the basic idea is that you give each slot a rank
and add an annotation that states the object should be serialized as a list rather than key-value
The internal representation would still be an object though.
List types
We would have a builtin class called List (possibly other types too) that would have a single slot for members
, default to Any, but this could be subclassed with different constraints on members. We would need to extend existing parsers and serializers such that members
is hidden and just the direct yaml/json structure is used. It's not clear what the best internal python implementation would be, we'd have to think how this would work with dataclasses, pydantic, etc
Which to choose?
The multidimensional data approach is part of a less specified longer term roadmap for LinkML, in collaboration with our LBNL colleagues who develop HDMF
I suspect the other two would fit your use case better. I think the object-as-list serialization is likely easiest
For the short term your best bet is an initial transform of the LoL data into a LoObjects, which may be not wholly satisfactory...
I would imagine that suggestions that the data provider change their date representations to something more sensible would get zero traction, unfortunately. It's a little frustrating as they provide YYYY-MM-DD and epoch timestamps elsewhere.
Due to other quirks with the data source, JSONschema, and linkML, it looks like I will have to do some transformations on the source data before it can be loaded, so converting the LoLs into LoOs (😆 -> 🚽) would be the most pragmatic approach.
I hadn't looked at JSONschema for a while, but it seems the newer drafts (post v7) have much more powerful ways of specifying data structures involving arrays. Worth bearing in mind if/when linkml is extended to cover nested arrays?
You win the internet for that emoji chain!
good call with the json-schema link. In fact I think conceiving of as Object-as-list Serialization simply as Tuples is the most elegant solution to your date representation and other analogous issues, and would work well with an internal tuple representation is most languages, e.g
>>> print(yaml.safe_dump((2022,8,9)))
- 2022
- 8
- 9
@ialarmedalien, we finally have first-class array support in the metamodel.
We are now opening more targeted issues, e.g
Describe the bug
The data source I'm parsing represents dates as an array of arrays of integers, i.e.
I am looking for a way to represent and validate this using linkml, but without a
list
orarray
type, I'm not sure how to do it.In JSONschema, it would be represented as
To Reproduce
Source data,
list_of_lists_data.json
linkml schema,
list_of_lists.yaml
As-is, the data file passes validation but there is no validation of the inner array of
list_of_lists
. If you addrange: string
to the definition oflist_of_lists_of_ints
, i.e....it passes validation, even though the contents of the field should not be parsed as a string.
Creating custom types
Since the default validation doesn't work, I tried creating a specific type.
...but there's no way to describe the type further as
multivalued
andrange
are not valid fields.OK, how about creating a custom class?
Doesn't work (not surprisingly).
From the python perspective, it should be possible to validate an array of arrays of ints, but how does one specify this in linkML?