linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
315 stars 98 forks source link

Specifying and validating an array of arrays of ... #895

Closed ialarmedalien closed 1 year ago

ialarmedalien commented 2 years ago

Describe the bug

The data source I'm parsing represents dates as an array of arrays of integers, i.e.

"date": [
    [ 2022, 8, 8 ]
]

I am looking for a way to represent and validate this using linkml, but without a list or array type, I'm not sure how to do it.

In JSONschema, it would be represented as

  "date": {
    "type": "array",
    "items": {
      "type": "array",
        "items": {
          "type": "integer",
        }
      }
   }

To Reproduce

Source data, list_of_lists_data.json

{ 
  "thing": {
    "list_of_lists": [
      [ 2020, 10, 8 ]
    ]
  }
}

linkml schema, list_of_lists.yaml

id: https://www.example/com/list_of_lists
name: list_of_lists
description: linkml spec containing a list of lists

prefixes:
  linkml: https://w3id.org/linkml/

imports:
  - linkml:types

default_range: string

slots:
  list_of_lists_of_ints:
    examples:
    - value: [[2020, 10, 8]]
    multivalued: true

classes:
  RootClass:
    tree_root: true
    attributes:
      thing:
        range: Thing

  Thing:
    slots:
    - list_of_lists_of_ints

As-is, the data file passes validation but there is no validation of the inner array of list_of_lists. If you add range: string to the definition of list_of_lists_of_ints, i.e.

slots:
  list_of_lists_of_ints:
    examples:
    - value: [[2020, 10, 8]]
    multivalued: true
    range: string

...it passes validation, even though the contents of the field should not be parsed as a string.

Creating custom types

Since the default validation doesn't work, I tried creating a specific type.

slots:
  list_of_lists_of_ints:
    examples:
    - value: [[2020, 10, 8]]
    multivalued: true
    range: ListOfIntsType

...
types:
  ListOfIntsType:  ## representing [2020, 10, 8]
    uri: rdf:List
    base: list  #  `base`: python base type that implements this type definition
    multivalued: true
    range: integer

...but there's no way to describe the type further as multivalued and range are not valid fields.

OK, how about creating a custom class?

slots:
  list_of_lists_of_ints:
    examples:
    - value: [[2020, 10, 8]]
    multivalued: true
    range: ListOfIntsClass  ## this is going to be representing [2020, 10, 8]

...
classes:
  ListOfIntsClass:
    attributes:
      '': # try to represent an anonymous array
        multivalued: true
        range: integer

Doesn't work (not surprisingly).

From the python perspective, it should be possible to validate an array of arrays of ints, but how does one specify this in linkML?

cmungall commented 2 years ago

Great detective work and valiant attempts, but indeed this is not supported, largely by intention. LinkML tries to be unopinionated, but doing things such as modeling dates as lists does go against the grain.

Having said that, we should have some kind of supported. I envision 3 alternate approaches

  1. Multidimensional data approach
  2. Fixed column serializations
  3. Add an explicit list type

I'll attempt to explain each

Multidimensional For the Multidimensional approach, it helps to think of a use case with multidimensional data. This is the kind of thing the HDMF format excels at. Let's say I have a dataset of temperatures at points on the globe at different time points.

image

The more LinkMLesque way to model this is something like

Dataset:
  attributes:
    observations:
      range: Observation
      multivalued: true
Observation:
  attributes:
    lat:
      range: decimal
    long:
      range: decimal
    height_in_meters:
      range: float
      unit:
        ucum_code: m
    temp_in_kelvin:
      range: float
      unit:
        ucum_code: K

Assume all fields are required. optionally we can add a compound unique key of (lat, long, height).

You can also imagine adding a time dimension here

There isn't any need for LoLs in our modeling. However, the default way of storing this as json/yaml may be inefficient, and data scientists may prefer to manipulate multidimensional arrays.

There are a few different ways of serializing, including a flat list with accomanying metadata on the dimensions:

image

But this could also be as a LoLoL

observations:
  _dimensions:
   lat:
    - 100, 120, 140
   long:
    - 100, 120, 140
   height:
    - 0
    - 20
    - 40
  _data:
  - - - 292
      - 293
      - 292
    - - 292
      - 290
      - 294
    - - 293
      - 293
      - 291
  - - - 296
      - 295
      - 293
    - - 296
      - 296
      - 296
    - - 296
      - 296
      - 296
  - - - 296
      - 298
      - 298
    - - 298
      - 298
      - 296
    - - 297
      - 297
      - 298

A specific example of this in our domain is biom format:

Object-as-list Serialization

Here the basic idea is that you give each slot a rank and add an annotation that states the object should be serialized as a list rather than key-value

The internal representation would still be an object though.

List types

We would have a builtin class called List (possibly other types too) that would have a single slot for members, default to Any, but this could be subclassed with different constraints on members. We would need to extend existing parsers and serializers such that members is hidden and just the direct yaml/json structure is used. It's not clear what the best internal python implementation would be, we'd have to think how this would work with dataclasses, pydantic, etc

Which to choose?

The multidimensional data approach is part of a less specified longer term roadmap for LinkML, in collaboration with our LBNL colleagues who develop HDMF

I suspect the other two would fit your use case better. I think the object-as-list serialization is likely easiest

For the short term your best bet is an initial transform of the LoL data into a LoObjects, which may be not wholly satisfactory...

ialarmedalien commented 2 years ago

I would imagine that suggestions that the data provider change their date representations to something more sensible would get zero traction, unfortunately. It's a little frustrating as they provide YYYY-MM-DD and epoch timestamps elsewhere.

Due to other quirks with the data source, JSONschema, and linkML, it looks like I will have to do some transformations on the source data before it can be loaded, so converting the LoLs into LoOs (😆 -> 🚽) would be the most pragmatic approach.

I hadn't looked at JSONschema for a while, but it seems the newer drafts (post v7) have much more powerful ways of specifying data structures involving arrays. Worth bearing in mind if/when linkml is extended to cover nested arrays?

cmungall commented 2 years ago

You win the internet for that emoji chain!

good call with the json-schema link. In fact I think conceiving of as Object-as-list Serialization simply as Tuples is the most elegant solution to your date representation and other analogous issues, and would work well with an internal tuple representation is most languages, e.g

>>> print(yaml.safe_dump((2022,8,9)))
- 2022
- 8
- 9
cmungall commented 8 months ago

@ialarmedalien, we finally have first-class array support in the metamodel.

We are now opening more targeted issues, e.g