gs1 / gs1-syntax-dictionary

GS1 Syntax Dictionary and reference Linters
Apache License 2.0
13 stars 1 forks source link

Replace iso3166list with multiple optional iso3166 #9

Open KDean-Dolphin opened 6 days ago

KDean-Dolphin commented 6 days ago

The use of iso3166list is a problem for simple validators that are concerned only with structural validation (i.e., type and length). The way AIs 423 and 425 are defined, a structural validator would allow "1234" to pass.

https://github.com/gs1/gs1-syntax-dictionary/blob/ee18e07a6e66e87a7f2c1e43dd5ab6aeff5b10d8/gs1-syntax-dictionary.txt#L213-L215

I suggest these be rewritten using the optional terminology as follows:

423         ?  N3,iso3166 [N3,iso3166] [N3,iso3166] [N3,iso3166] [N3,iso3166] req=01,02 ex=426                     # COUNTRY - INITIAL PROCESS
424         ?  N3,iso3166                    req=01,02 ex=426                     # COUNTRY - PROCESS
425         ?  N3,iso3166 [N3,iso3166] [N3,iso3166] [N3,iso3166] [N3,iso3166] req=01,02 ex=426                     # COUNTRY - DISASSEMBLY

An alternative would be to borrow the repetition syntax from regex:

423         ?  N3,iso3166{1,5}               req=01,02 ex=426                     # COUNTRY - INITIAL PROCESS
424         ?  N3,iso3166                    req=01,02 ex=426                     # COUNTRY - PROCESS
425         ?  N3,iso3166{1,5}               req=01,02 ex=426                     # COUNTRY - DISASSEMBLY
terryburton commented 6 days ago

We have a similar issues with minimum/specific lengths with AIs such as (8007) "X..34,iban" where an ISBN cannot practically be as short as a single character and it is only the linter than enforces the length. The same with the key linter that (when not hooking into an external GCP lookup function) ensures a minimum of (currently) four leading digits.

I've never considered the current behaviour to be a problem since we provide the linters, as well as reference code (the Syntax Engine) for how to apply them, so code that is concerned only with structural validation can be easily enhanced. If people wanted a structure-only solution then maybe they could resort to the published regexs instead.

Nevertheless, I'm still tempted by the suggested N3,iso3166 [N3],iso3166 [N3],iso3166 ... syntax for the ISO 3166 list examples by virtue of the fact that is makes a linter redundant. Being smart by "doing more with less" is a worthwhile goal!

The suggested syntax works in the specific cases that you mention, but we have to be very careful about introducing a syntax that selects between more than a single optional component since it may easily lead to ambiguity, e.g. when the linters or character sets differ. Given X3 [X5],wibble [N5],wobble and an input ABC12345, do we parse 12345 using csetX and process it with wibble or do we parse it with csetN and process it with wobble?

We must carefully assert that the validation process involves a left to right match against each component of the AI's format specification for consecutive format-defined-size prefixes of the given data, declaring success if only optional components remain when we run out of data.

I've pushed a Syntax Dictionary update (currently with another test AI (85)) as well as some corresponding unit tests to a "multi-optional" branch of the Syntax Engine: https://github.com/gs1/gs1-syntax-engine/commit/2c765a6b345e3f5a2a435c2fc7905a7fa170af93

The existing Syntax Engine code already does everything you would expect when it encounters multiple optional trailing components. It also survives fuzzing which generally is a good way of finding unexpected issues with changes of syntax. I'll give it some more thought and see if anything comes up.

I would feel more reassured by the approach if we can introduce constraints into the GenSpecs for the AI system as a whole:

Either (preferably), forbid AIs from being defined with multiple optional components (currently there is only ever one optional final component) so that we are free to define fine-grained substructure in the BSR without fear of the syntax clashing with the GenSpecs in the future. This is the status quo, but I have not see it codified anywhere, so it's not clear whether it will remain so.

Or (less ideal), define (1) that all optional components must be at the end of the format specification (just as is currently the case) and that (2) all non-terminal components, whether mandatory or optional, must have fixed lengths (plainly obvious to avoid ambiguity). Then define a single correct matching action for the case of multiple optional components — preferably using left-to-right matching described above, and certainly not resorting to a "feasibility check" in order to resolve ambiguity.

Without some codified constrains I think that we risk defining a behaviour for the BSR that gets redefined in the future by updates to the GenSpecs, possibly even with the GenSpec's behaviour being defined on an individual AI basis (for future AIs). This would not align with the BSRs existing notion of left-to-right componentisation.

Whilst the alternative N3,iso3166{1,5} syntax is perhaps less open to misinterpretation (since it cannot be misconstrued as selecting between differing optional components) I am reluctant to extend the current syntax with repetition so that it is not a basic left-to-right match (or give a false impression that a regex engine is required).

terryburton commented 6 days ago

Done in 6d37ba2. Thanks for the suggestion.

Leaving this open until we've captured the GS1 GenSpecs AI design constraints issue elsewhere.