sdmx-twg / sdmx-json

This repository is used for maintaining the SDMX-JSON message specifications.
54 stars 20 forks source link

Proposal to reference DSD from data message via `agencyID` and `version` #145

Open hoehrmann opened 7 months ago

hoehrmann commented 7 months ago

Introduction:

The current SDMX-JSON data message specification requires that the data.structure object reference the Data Structure Definition (DSD) through a link to the Data Provision Agreement (DPA) or Dataflow (DF). This approach necessitates multiple lookups and parsing URN references to identify the relevant DSD. It also presents challenges when dealing with non-URN references or situations where the referenced web version becomes unavailable.

Proposed Improvement:

To enhance clarity, convenience, and validation capabilities, this proposal recommends including the agencyID and version of the referenced DSD directly within the data.structure object of SDMX-JSON messages. This eliminates the need for intermediary references through DPA or DF and streamlines the process of identifying the relevant DSD.

Benefits:

Implementation:

Conclusion:

This proposal offers a more efficient and reliable approach to referencing DSDs within SDMX-JSON messages. Direct inclusion of agencyID and version within the data.structure object simplifies data access, enhances validation, and ensures persistent DSD references, fostering a more streamlined and robust data exchange experience.

Alternative:

Having a datastructure URN reference would also be okay, but then id becomes redundant.

(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)

dosse commented 7 months ago

Hi @hoehrmann, the need is not clear:

hoehrmann commented 7 months ago

Summary: The intent of the current requirement seems to be to ensure the DSD can be identified from data messages. The current requirements are insufficient to ensure this goal. The idiomatic way would be adding the agencyID and version properties, or add a mandatory new property like datastructure that contains the URN. It would also be possible to add a requirement that the dataStructure link has a urn property, but for implementations that read data messages going through all links, and going through all rel values, and checking the urn field for each is more complicated than direct properties, and adding a JSON Schema validation rule that checks that the DSD URN is present or can be computed is also more complicated.

Details:

Right now the requirement in the specification is satisfied by:

data:
  structures:
    - links:
        - rel: dataStructure
          href: https://example.org/X347.xml

There is no way to infer the URN of the DSD. This is valid because only any link to the DSD is required, specifying its URN is not required, and it does not have to be hosted as a SDMX-REST web service (otherwise you could guess the needed URN parts from the URL). This could be addressed by adding that for the dataStructure link the urn property must be specified.

As for validation, take this example https://sdmx.oecd.org/public/rest/v2/data/dataflow/OECD.SDD.NAD.SEEA/DSD_NAT_RES@DF_NAT_RES/1.0/CAN.A.T.LEAD.*.A?. That claims to be a SDMX-JSON 2.0.0 data message. Ignoring the error in the contentLanguages property, the incorrect use of ~ in dimension index lists, and the wrong time format for TIME_PERIOD, the message is valid according to the JSON Schema for SDMX-JSON data messages, even though it does not have the required link (it puts the link on the dataSet instead of the Structure).

It is probably possible to amend the schema to require that in this specific case there must be one link with rel containing dataStructure, but it would increase the complexity of the schema.

As for lookups, the current requirement is satisfied by referencing a provisioning agreement. Even if you are lucky and it references a SDMX-REST end point, you can probably only get the dataflow with a single request. In theory you could use references=descendants or all but those likely return an unreasonable amount of data and/or might be disabled or throttled on public endpoints as denial of service protection.

As for persistence, if I knew the URN I could try to look it up elsewhere (e.g., I may have old data and the web server just changed its address) but without it I would have to guess the URN (based on the IDs of the fields).

dosse commented 7 months ago

Thanks for the clarifications. The intent of the current link requirement is to ensure that the artefact (either DF, DSD or ProvisionAgreement) for which data have been requested can be fully identified from the data message. The choice of the artefact type is not arbitrary but must correspond to the artefact used in the original data request. This was meant with the wording "At least the link to the Data Structure Definition, Dataflow or Data Provision Agreement to which the data relates is required.", but the wording in the field guide can be improved. E.g., if data was requested for a dataflow, then the dataflow identification is required. If data was requested for a dsd, then the dsd identification is required. Also, in order for the full artefact identification to be available immediately, that link requires the usage of 'self' for the relationship and indeed the URN of the artefact, e.g.,

    "href": "https://registry.sdmx.org/ws/rest/dataflow/ECB.DISS/BSI_PUB/1.0",
    "rel": "self",
    "urn": "urn:sdmx:org.sdmx.infomodel.datastructure.dataflow=ECB.DISS:BSI_PUB(1.0)"

This information is sufficient to retrieve all required structure artefacts in one single request. This can be further clarified in the field guide.

If a client requires the structure information at a later time than the client is free to extract and store the structure information at the same time than the data. If you need the get just the DSD for a DF, then you could use the references=datastructure parameter. Disabling structure retrieval through references as a 'denial of service' protection seems to me an unreasonable approach. Compared to data extractions, structure messages are usually much smaller.

I would conclude, that this ticket specifically requests that the URN of the underlying artefact can be found in a more straightforward way (without looping through the links array), by taking it out of the links array and adding it as a separate structure property (similar to the SDMX-ML data messages) or, e.g., by requiring to position that link as the first value of the links array.

For issues you find in the practical SDMX implementation https://sdmx.oecd.org/public/rest/, could you please open tickets in this separate code repository https://gitlab.com/sis-cc/.stat-suite/dotstatsuite-core-sdmxri-nsi-ws/-/issues/ ?

hoehrmann commented 7 months ago

I would like to add the following point: in a structure message external structures are referenced like this

data:
  dataStructures:
    - id: EXAMPLE
      agencyID: EXAMPLE
      version: "1.0.0"
      name: Example
      isExternalReference: true

Using links to reference a DSD in data messages is inconsistent with this pattern.