sdmx-twg / sdmx-im

SDMX Information Model - UML model and functional description, definition of classes, associations and attributes
6 stars 3 forks source link

Use of Regular Expressions to define constraints #8

Closed egreising closed 9 months ago

egreising commented 6 years ago

Several sets of category items (a variant) can be associated to one classification, but only one is used in a DF or PA, depending of the context. (e.g. ISIC versions in CL_ACTIVITY). The way this can be achieved is using content constraints attached to the dataflow or provision agreement. However, the constraint specification must be done “by extension”, thus enumerating all the codes to be included or excluded. The length of this list can be a problem. For example, the list of codes to be included in a content constraint to include the items corresponding to the ISIC Rev. 4 variant from the CL_ACTIVITY code list is composed of approximately 770 items.

Using a common prefix to identify all codes belonging to a variant helps in managing long code lists with many variants, like CLACTIVITY (e.g. ISIC4, ISIC3, NACE2, AGG_, etc.) It has been proposed to allow the use of regular expressions in the creation of constraints in order to reduce the length and complexity of their definition. Nevertheless, to deal with the issue of multiple variants, and considering the adoption of the “prefixed codes” practice, just a “wild card” character would be enough, and make the use of regular expressions not advisable as it would “overload” the solution It is suggested to simply use the percentage sign (%) as the wildcard character. Following the same example mentioned ut-supra, the more than 770 items can be reduced to a single wildcarded element.

image

stratosn commented 9 months ago

This is already supported in SDMX 3.0, as also explained in section 6 technical notes, at section 10.3.4.2 Combination of Constraints (lines 1831-1836) and reads: A Member Selection may include wildcarding of values (using character ‘%’ to represent zero or more occurrences of any character), as well as cascading through hierarchic structures (e.g., parents in Codelist), or localised values (e.g., text for English only). Lack of locale means any language may match. Cascading values are mutual exclusive to localised values, as the former refer to coded values, while the latter refer to uncoded values.

and more specifically examples: