sdmx-twg / sdmx-im

SDMX Information Model - UML model and functional description, definition of classes, associations and attributes
6 stars 3 forks source link

Schema attribute attachment at series & group level #7

Open Tzaphkiel opened 6 years ago

Tzaphkiel commented 6 years ago

We have noticed that in the registry, a schema has a duplication of attributes at sibling and series level even when the DSD (in 2.0 SDMX-ML format) actually mentions a group level attachment:

SDMX-ML 2.0 DSD

<str:Attribute assignmentStatus="Mandatory" attachmentLevel="Group" codelist="CL_UNIT" codelistAgency="ECB" codelistVersion="1.0" conceptSchemeAgency="ECB" conceptSchemeRef="ECB_CONCEPTS" conceptRef="UNIT" conceptVersion="1.0">
<str:AttachmentGroup>Group</str:AttachmentGroup>
</str:Attribute>

SDMX-2.1 schema image

When generating the SDMX-ML 2.1 DSD and schema the attributes are at the same time attached to the series and the group... This apparently was done so in order to optimize for the streaming when exchanging data.

The problem is that it is subject to interpretation in the specification and should (/would need to) be clarified.

agent96 commented 6 years ago

To comment on this, with the following notes from Xavier:

  1. ECB DSDs typically define sibling groups, to which attributes are attached.
  2. As we know, groups were introduced in SDMX as an optimization: attributes could be attached once and not repeated for every series. Whether the expected gain materializes actually depends on the data in the dataset. In case you have defined sibling groups and you have series varying on the frequency, you will see some gains. In case, your dataset only contains, say, quarterly series, using groups will actually make your file larger…
  3. Despite the fact that groups have questionable benefits as shown above, you also cannot use them in case of a streaming web services. In SDMX 2.1, all group elements have to come first in the data message, and are then followed by the series elements. You cannot do that in a streaming web service, as, by definition, you don’t have the entire result set in memory, so you can’t start by outputting all groups. Of course, we could send one query for the groups and then one for the series but, if optimal performance is a concern, that path is closed.
  4. The compromise was then to use the Dimension Group, which allows attaching attributes wherever a dimension (or combination of dimensions) is available. That would be the Series element in the case of the streaming SDMX 2.1 web services, or the Group elements, in case of batch data exchanges in SDMX-EDI or SDMX-ML 2.0. In short, both Lisbeth and I were happy ;-).
  5. But for the validation to succeed, it means that the attributes need to be repeated to any level where the dimension (or combination of dimensions) can be found. Hence the duplication.

So to summarize, Dimension Group/Group duplication was introduced to support streaming in v2.1 whilst maintaining backwards compatibility with v2.0.

To tighten up the documentation, we need to decide what to do for the following use case:

  1. DSD has Group, and an Attribute defined which is attached to a group of Dimensions (Dimension Group) which matches the dimensions referenced by the Group.
  2. User exports DSD in v2.1 it shows the Attribute is attached ot the Dimension Group
  3. User exports DSD in v2.0 or SDMX-EDI it shows the Attribute is attached to the Group

User Creates a Schema in v2.0 - it should state the attribute value is reported against the Group

User Creates a Schema in v2.1 - currently this is ambiguous as to what the action is, as it is not documented what to do. The above bug is raised against an implementation that allows the user to choose whether to report against the Group of the Dimension Group.

Proposal is to either:

agent96 commented 6 years ago

A statement needs to be added to section 6 technical notes, at the end of section 9.2.1.

This could be either:

When a Data Structure Definition defines an Attribute which is both a member of a Group and a Dimension Group containing the same Dimensions, the following rules apply for Schema Generation:

  1. When generating a version 2.1 schema the Attribute should be defined at the level of the Series
  2. When generating a version 2.0 or version 1.0 schema the Attribute should be defined at the level of the Group

The above rules has the implication that transformation of a version 2.0 dataset into a version 2.1 dataset may require movement of group level attributes to series level attributes.

OR (as we have currently implemented in the Fusion Registry)

When a Data Structure Definition defines an Attribute which is both a member of a Group and a Dimension Group containing the same Dimensions, the following rules apply for Schema Generation:

  1. When generating a version 2.1 schema the Attribute should be defined at both the level of the Series and the level of the Group
  2. When generating a version 2.0 or version 1.0 schema the Attribute should be defined at the level of the Group

The above rules has the implication that transformation of a version 2.0 dataset into a version 2.1 dataset group level attributes may remain as group level attributes or be transferred to the series level attributes. In addition data reporters have a choice of attachment level for the attribute.