smart-data-models / data-models

Data Models in common use based on real world use-cases. These definitions underpin a digital market of interoperable and replicable smart solutions.
https://smartdatamodels.org/
111 stars 56 forks source link

Subschema's are not deterministic #41

Closed auphofBSF closed 1 year ago

auphofBSF commented 1 year ago

I believe there is an issue on being deterministic about subschemas in a FIWARE SmartDataModel (SDM) model. An example being in Device where the subschema "$ref": "https://smart-data-models.github.io/data-models/common-schema.json#/definitions/GSMA-Commons" can change without a user being aware

The issue I will demonstrate and discuss is the effect to resultant uses of a schema.json where the present version and configuration management of references of subschemas such as GSMA-Commons is not deterministic .

This is exemplified by an experience I just had and will detail further

I have an automated process where by I generate python pydantic model objects for any FIWARE SDM. This Pydantic Model is then used in to interact with or extend a legacy data structure, Through this mechanism I would like to offer our existing data to 3rd parties or new processes or consume data (with appropriate access) against a FIWARE SDM standard that is deterministic.

In the process of development and testing I regenerated one of these SDM Model as a pydantic object and ended up with a significantly different Model object. In contrast that over the prior days the pydantic model was being generated repeatedly and deterministically. The Models schema.json did not change , so I had to figure out what had changed ! SmartDataModels_issue22w39d5a

I believe it is changes in subschema that the $Refs point to. These subschema links are snapshots and not deterministic. I cannot identify what and when it changed, I can see that some backend activities must have happened. Unfortunately as with the subschema commons-GSMA even looking at the source repo, and looking at all the history git log -p -- commons-gsma.json I could not find a commit for that subschema that would have generated the pydantic model that I had commited 3 days prior.

I assumed and believed the SDM schema.json of a particular commit would deterministically regenerate the pydantic model as long as generator and schema where the same. I now believe any users of any SDM schema are vulnerable to a subschema changes introducing potentially breaking changes

Configuration

Pydantic Model that has shown the problem is using the following Subject and DataModel

    subject_name="dataModel.Device",
    data_model_name="Device",

Pydantic Model generators has had no changes ( same Github commit)

I am accessing a schema.json for Device from a local instance of Repo https://github.com/smart-data-models/dataModel.Device.git I pulled this prior a few days ago to the then last commit and no updates done. My git log is still at 30th Aug

git log
commit f8c87c97cb8e1add4687a70b1f65bdd5409d706b (HEAD -> master, origin/master, origin/HEAD)
Author: Alberto Abella <alberto.abella@fiware.org>
Date:   Tue Aug 30 09:55:39 2022 +0200

    beta version of DTDL digital twin

commit e220b82c7f4eb20fc6ffe3c4ba341769b016eea3
Author: Alberto Abella <alberto.abella@fiware.org>
Date:   Tue Aug 30 09:55:17 2022 +0200

    Exported example to csv example.jsonld

Analysis

What Changed and How do we control for change ?

The model generator is under git version control, and confirmed no changes The model generator builds out from schema.json and obviously retrieves definitions from referenced subschema's.

In the case of Device this root schema.json is uniquely versioned from the source repo. Hence my Pydantic Model object can be versioned to this schema commit https://github.com/smart-data-models/dataModel.Device/blob/f8c87c97cb8e1add4687a70b1f65bdd5409d706b/Device/schema.json#L3 there in most cases also a version attribute in the schema ie "$schemaVersion": "0.0.7",

Example of this root schema.json


  "$schema": "http://json-schema.org/schema#",
  "$schemaVersion": "0.0.7",
  "modelTags": "",
  "$id": "https://smart-data-models.github.io/dataModel.Device/Device/schema.json",
  "title": " Smart Data Models - Device schema",
  "description": "An apparatus (hardware + software + firmware) intended to accomplish a particular task (sensing the environment, actuating, etc.).",
  "type": "object",
  "allOf": [
    {
      "$ref": "https://smart-data-models.github.io/data-models/common-schema.json#/definitions/GSMA-Commons"
    },
    {
      "$ref": "https://smart-data-models.github.io/data-models/common-schema.json#/definitions/Location-Commons"
    },
    {
      "$ref": "https://smart-data-models.github.io/dataModel.Device/device-schema.json#/definitions/Device-Commons"
    },

The root schema for Device can be deterministically versioned . An attribute can be assigned to my Pydantic Model being "SDMVersion": "0.0.7.f8c87c97cb8e1add4687a70b1f65bdd5409d706b", <$schemaVersion>.<GitCommit> and "SDMrootSchema": "https://raw.githubusercontent.com/smart-data-models/dataModel.Device/f8c87c97cb8e1add4687a70b1f65bdd5409d706b/Device/schema.json", <$immutableSourceURL> linking my pydatic model to a deterministic root schema

The issue in the case of Device is that the subschemas introduced by

"allOf": [
    {"$ref":….},
    {"$ref":….},
    …… 
    ]

are not deterministic, there is no versioning in the $ref: subschema's urls. Any change to a referenced schema could introduce a breaking change to a pydantic model derived from the root schema.json

The issue then becomes where should this versioning managed. Any change in the nth subschema will necessitate a new dependency driven version increment in all associated parent schemas above this nth tier subschema change

Options

  1. Centrally control dependency management for referenced subschemas. This I believe would need to in the FIWARE SDM Subject Repo ie the $ref changes from https://smart-data-models.github.io/data-models/common-schema.json#/definitions/GSMA-Commons to https://raw.githubusercontent.com/smart-data-models/data-models/c4ee5d39bcbacdc30700bcd2d916aaf2c50dc86e/common-schema.json#/definitions/GSMA-Commons This would be my preferred option

    • All Parent Schemas for a changed subschema need to have the $ref updated and Parents schema committed with new subschema changes. - Change control and version dependency management is easily visible to all consumers

    • Migrations can be planned

    • Viability depends if this is human or script managed. Some bearing on how deep the tree of reference subschemas can be but not impossible for an algorithm to process. Without a deep look into all models, I manually viewed the SDM dataModel.Device/Device and particularly the definitions/GSMA-Commons. There does not appear to be any schema references external to this $ref .

      However some of the $Refs are Relative and some are full URLS. ideally they should all be relative see

      
          "owner": {
        "type": "array",
        "description": "Property. A List containing a JSON encoded sequence of characters referencing the unique Ids of the owner(s)",
        "items": {
          "$ref": "https://smart-data-models.github.io/data-models/common-schema.json#/definitions/EntityIdentifierType"
        }
      },
      From <https://raw.githubusercontent.com/smart-data-models/data-models/c4ee5d39bcbacdc30700bcd2d916aaf2c50dc86e/common-schema.json> 
    vs  the  `Contact-Commons` in the same schema
            "email": {
          "$ref": "#/definitions/email"
        },

    From https://raw.githubusercontent.com/smart-data-models/data-models/c4ee5d39bcbacdc30700bcd2d916aaf2c50dc86e/common-schema.json

  2. In the absence of Option 1 dependency management done by the user of a schema. A map is generated for all referenced schema's including subschemas. These are pulled and versioned. The user then changes all references to the appropriate fixed dependencies. The users maintains version management, resultant schema may not be able to be shared. Potential Issue's : no central version registry, incompatibility between objects in 2 different systems that don’t have access to common repositories

albertoabellagarcia commented 1 year ago

What you say is completely right. Eventually, a change in the GSMA-commons(or location commons) would alter the referred schemas. But we have very good reasons for not changing it or that the change does not affect current models (i.e. a new attribute but keeping all old ones) In fact, our policy is that we never do backward incompatible versions. Could this happen in the coming future? Nothing is impossible but I cannot envision a reason to do it and of course never without deep consultation with the users and contributors. Anyhow we are open to adopt other additional mechanism to fix this.

albertoabellagarcia commented 1 year ago

Besides now we have a draft of a database of data models' versions so you can gather the moment when the extract was created. Grabbing the information from model.yaml (which has the $ref attributes) should be solved.

albertoabellagarcia commented 1 year ago

Regarding these two options 1) https://smart-data-models.github.io/data-models/common-schema.json#/definitions/GSMA-Commons or 2) https://raw.githubusercontent.com/smart-data-models/data-models/c4ee5d39bcbacdc30700bcd2d916aaf2c50dc86e/common-schema.json#/definitions/GSMA-Commons

1) in the model.yaml of the data model these references are brought so it is like your solution 2. Quite deterministic. is this a solution for you?

albertoabellagarcia commented 1 year ago

finally, what you could use is to point not to json schema (non-deterministic) but to the yaml version which is deterministic for every version.