Add "decimal" linter; percent-encodability during DL URI conversion

gs1 / gs1-syntax-dictionary

GS1 Barcode Syntax Dictionary and Syntax Tests

Apache License 2.0

13 stars 1 forks source link

Add "decimal" linter; percent-encodability during DL URI conversion #15

Open KDean-Dolphin opened 4 months ago

KDean-Dolphin commented 4 months ago

Numerous AIs are numeric with a decimal point implied by the last digit of the AI.

31nn
32nn
33nn
34nn
35nn
36nn
39nn

Automatic encoding and decoding would be aided by adding a "decimal" linter type, i.e.,:

...
3100-3105  *?  N6,decimal                    req=01,02 ex=310n                    # NET WEIGHT (kg)
3110-3115  *?  N6,decimal                    req=01,02 ex=311n                    # LENGTH (m)
...
3900-3909   ?  N..15,decimal                 req=255,8020 ex=390n,391n,394n,8111  # AMOUNT
3910-3919   ?  N3,iso4217 N..15,decimal      req=8020 ex=391n                     # AMOUNT
...

terryburton commented 4 months ago

What would the action of such a linter be when passed the data for the component that is tagged against? Note that a linter function only sees the component data, not the AI nor other components.

I think that we need to consider issues related to encoding/decoding of AI values using a separate (non-linter) framework. This belief is primarily driven by behaviours that might be expected for percent encoded AIs...

To my present understanding, the raw value of a AI is whatever the user puts in it, conforming to CSET-82 or whatever format specification is in effect. So, irrespective of whether some AI is "percent-encoding enabled" or not, an AI value containing "ABC%40XYZ" has a length 9, and not 7, i.e. it would satisfy X9 but not X7. The treatment of all AIs is identical within the system: Percent-encoding is an in-data means of expressing non-CSET-82 characters in selected AIs, being defined to be useful in the content of those AIs, and is not some overarching new property. It is left up to the application to render the contents appropriately for such fields, based on whether percent encoding is permitted or not, just as they must decide whether to pretty-print dates, decimals, etc.

If it were any other way, I think implementors would get a shock as I suspect many field sizes are fixed based on the format specification, and they would have trouble fitting ABC%40XYZ into a 7-character field. We would have to tell them that the onus is now on them to decode the raw AI data to ABC@XYZ before storing it or processing it — but only for the "percent-encodable" AIs, not regular AIs where % is in common use with no special meaning — and that the percent-encoded representation is only intended for certain intermediate representations, e.g. barcode. But what about when such an AI value is represented in a GS1 DL URI?...

Currently a DL URI containing such data value ABC%40XYZ is double escaped, resulting in https://id.gs1.org/01/12312312312319?99=ABC%2540XYZ. Note %40 in the AI value becomes %2540 in the URI query parameter.

There's a school of thought that it shouldn't be this way for percent-encoding enabled AI values, which I appreciate because double escaping is wasteful.

However, to achieve this in a way that does not introduce ambiguities based on whether the AI is percent-encoding enabled or not, it would be required that we formally tagged (or subtyped, or whatever) the special AIs with an encoder/decoder pair and described exactly when and where a raw value (i.e. ABC%40XYZ in the case of percent encoding) should be used versus an interpreted/rendered value (i.e. ABC@XYZ in the case of percent encoding).

Such treatment means that the interpretation of a percent-encoded value encoded in a DL URI parameter varies according to whether the AI is percent-encoding enabled or not, which adds bloat to lightweight processes such as the barcode reader shim when extracting element strings from URIs, requiring that they also have a table of percent-encoding-enabled AIs.

So it's not clear that we want to add such complexity to save a few characters, and adding such complexity would add a new concept (encoder/decoder pairs) to the GS1 system that would touch many standards.

Aside: The encoding/decoding processes are nuanced (not merely left-and-shift), even for percent encoding, and require a full decode/encode round trip between percent-encoded AI values and DL URIs and vice versa. In percent-encoded AI values the encoding is used to escape non-CSET 82 characters, which is different from the use in URIs where it escapes URI unsafe characters. E.g. "A&B" is valid content in CSET 82 AIs, but must be escaped in URI parameters; "A~B" is valid content in a URI parameter but must be escaped in percent-encoding-enabled CSET 82 AIs (and rejected otherwise).

KDean-Dolphin commented 4 months ago

From the linter perspective, "decimal" would do nothing; the content is just a string of numbers so the regular number validation would work just fine. As you suggest, though, I'm looking at it from an encoding/decoding perspective that can be aware of individual AIs. The framework I'm developing uses the syntax dictionary to handle as many generic cases as possible in a way that minimizes the work required by the user.

For example, an element string that contains AIs 01, 10, and 17 would be decoded into an array containing, in the same order, a string (with the character set and check digit validated), a string (with the character set validated), and a date/time object (with the individual fields validated and the year adjusted according to the 50-year window rule). The dictionary-based decoding fails for the "decimal" AIs above because it sees them as whole numbers, when in fact the values that should be extracted are value / 10^(last digit of AI), because the decoder can't tell the difference between them and other numeric (whole number) AIs.

I want to take as much of the requirement to learn the intricacies of the AI system away from the user as I can. So, when constructing an element string, if the user says "add 'LENGTH (m)' of 123.45 to the element string", the encoder should automatically generate AI 3112 based on a) the title and b) the position of the decimal.

The percent encoding question you pose is an interesting one, and I think it's worth taking up in the Web Technology SMG. Here are my thoughts on the matter:

The only AIs that are percent-encoded are those for Scan4Transport and patient demographics.
Neither of those applications support GS1 Digital Link.
- The GenSpecs and the patient demographics GSCN don't include the non-GS1 QR code in the list of approved data carriers.
- Section 5.2 of the Scan4Transport guideline explicitly states that GS1 Digital Link is "not yet approved by GS1 as an open Application Standard nor subject to conformance".
The true definition of any of these AIs is "any string containing any character (the domain), encoded to characters in GS1 AI encodable character set 82 (the range)".
- As you pointed out, there's a gap with the tilde ('~'), which is not in GS1 AI encodable character set 82 and is not subject to percent encoding, so there's an extra step required if the string contains that character.

Because of the above, there should be no double-percent-encoding when encoding these AIs in a GS1 Digital Link URI. If a specification is defined with "pcenc", the linter should accept the string unconditionally when the content comes from a URI, because it will have already been decoded (the same way lot "a/b" would have been decoded from "3%47b" prior to going through the linter).

Tagging @mgh128 and @philarcher for input.

terryburton commented 4 months ago

From the linter perspective, "decimal" would do nothing; the content is just a string of numbers so the regular number validation would work just fine. As you suggest, though, I'm looking at it from an encoding/decoding perspective that can be aware of individual AIs. The framework I'm developing uses the syntax dictionary to handle as many generic cases as possible in a way that minimizes the work required by the user.

For example, an element string that contains AIs 01, 10, and 17 would be decoded into an array containing, in the same order, a string (with the character set and check digit validated), a string (with the character set validated), and a date/time object (with the individual fields validated and the year adjusted according to the 50-year window rule). The dictionary-based decoding fails for the "decimal" AIs above because it sees them as whole numbers, when in fact the values that should be extracted are value / 10^(last digit of AI), because the decoder can't tell the difference between them and other numeric (whole number) AIs.

I want to take as much of the requirement to learn the intricacies of the AI system away from the user as I can. So, when constructing an element string, if the user says "add 'LENGTH (m)' of 123.45 to the element string", the encoder should automatically generate AI 3112 based on a) the title and b) the position of the decimal.

I'll try to give it some thought, but I don't think the goal of decomposing AIs into components for the purpose of validation and for the purpose of semantics/presentation are necessarily well aligned.

So NAK on this particular suggestion for the time being, but I too would like to see this, and the broader issue of encoding/decoding from raw values to "presentation formats", comprehensively tackled. Whether the solution involves a common "semantic decomposition of the AI" step (modification of the current component scheme to serve both purposes) or the use of separate component specifications for validation vs formatting, I'm not sure yet.

I agree that we should take this and the wider issue to the Web Tech SMG. I'll try to work up a white paper for discussion, but it won't happen this month and I want to solicit requirements from the other Open Source stakeholders in the Syntax Dictionary prior to growing the scope.

Thanks also for your helpful thoughts on percent encoding. I'll try to lay out the issue of generic encoding/decoding in the whitepaper. I don't want to get too focused on one particular encoding method but rather it would be helpful to review what is required to build a framework for tackling the general issue.

Happy to hear what others think, of course. But it may be a few weeks before I can devote the required attention.

KDean-Dolphin commented 4 months ago

Thanks. I'll happily contribute to the white paper.

mgh128 commented 4 months ago

Hi Kevin, Terry, Phil,

What Kevin is proposing sounds very similar to work on semantics of GS1 Application Identifiers, which we started in the work on GS1 Digital Link but which is incomplete and is more broadly applicable, irrespective of whether element string or GS1 Digital Link URI syntax is used.

In addition to chapter 9 of https://ref.gs1.org/standards/digital-link/, these two draft documents may be helpful:

https://docs.google.com/document/d/1htR_74P0-SGKQoCvtW_l5CWmGCbmp09S-MC0DwdDHD4/edit?usp=drivesdk

https://docs.google.com/document/d/1XpvaD7H_KbSPPU6ISmFlSOrwimYGRRgtRkEQnh5NnyA/edit?usp=drivesdk

I think that rather than overloading / bloating the Barcode Syntax Resource dictionary with details of semantics, we should instead extend the GS1 ATA dataset ( https://ref.gs1.org/ai/GS1_Application_Identifiers.jsonld ) because that is less constrained by memory size and more easily ingestible, being JSON/JSON-LD without requiring a dedicated parser for a particular compact text format.

There's also an incomplete prototype toolkit for semantics of GS1 AIs within https://gs1.github.io/GS1DigitalLinkCompressionPrototype/ or

https://github.com/gs1/GS1DigitalLinkCompressionPrototype

Of course, all of this should be done within GSMP, even if some of us further develop ideas and prototypes that later feed into a formal reviewed standardisation effort on this topic.

Best wishes

Mark

philarcher commented 4 months ago

Whether a given application standard declares that DL URI syntax can or can't be used is, I think, a second-order issue best left to the developers of those documents. Our task is, I think, to provide the solid ground work.

Looking at the very early work being done on the 'Identification Standard' that is targeted for the Jan '26 GenSpecs (perhaps with some bits appearing in Jan '25), there is no mention there of semantics. Nor do I think they'll appear as I had thought some time ago. @PetaDing-GS1 might be able to comment at a future date.

What we're talking about here is machine-readable reference files and some core open-source libraries that use them. I feel more comfortable defining the reference files first and thinking about software functions second. The BSR is focused on the AIDC world in which URI syntax is effectively alien so that converting between the AIDC and online worlds is an important function. I'd hope we can avoid double percent encoding. It's only needed in URI syntax so it really only comes up when generating or parsing a DL URI, no?

If the value to be encoded in AIDC-land is %40 then you need to encode the % character to put it in the URI which makes it %2540, yes. The original %40 doesn't itself ever need to be decoded does it?? I may be misunderstanding that.

To cut to the chase, I agree with @mgh128 that we should complete the work of defining the semantics of each AI and make sure that's part of the ATA. I had thought this should probably be done by the ID SMG but in reality, it probably needs to be done primarily in the Web Tech SMG with a strong liaison with the ID SMG, with an end goal to include all the semantic definitions - which could well include the kind of thing @KDean-Dolphin began this thread with, presentation detail and so on. SHACL supports regular expressions so you could go all the way from 3103 to a property gs1:netWeight with a value that matched a given regex for the value (there may be a better way of doing this but SHACL is likely to have the kind of powerful validation we need).

We need to have a clear idea of the status of the ATA that AFAIK, that hasn't been formalized yet. But that wouldn't stop us completing the work to define the semantics and, I hope, creating tools that convert between a bunch of AIs and their values and a block of JSON-LD (and vice versa of course).

terryburton commented 4 months ago

The BSR is focused on the AIDC world in which URI syntax is effectively alien so that converting between the AIDC and online worlds is an important function. I'd hope we can avoid double percent encoding. It's only needed in URI syntax so it really only comes up when generating or parsing a DL URI, no?

Correct. The BSR performs no interpretation of raw AI values during data processing. They are effectively opaques during the data conversion operations because no process that thus far required them to interpreted as anything other than the raw byte values that they are. Avoiding double escaping during conversion to/from a GS1 DL URI would be the first process that introduces a requirement to consider the raw AI value to be a representation of some more fundamental value.

If the value to be encoded in AIDC-land is %40 then you need to encode the % character to put it in the URI which makes it %2540, yes. The original %40 doesn't itself ever need to be decoded does it?? I may be misunderstanding that.

Consider the AI element string: (99)ABC%40XYZ(4300)ABC%40XYZ

In a world where we want to avoid double escaping of percent-encoded AI values within DL URIs, we have no choice but to deal with the concept of a intermediate "interpretation" of AIs' raw value content, i.e. the platonic meaning outside of any particular encoding scheme (or UTF-8 which is used as the universal proxy for such non-worldly purposes in many systems).

So the breakdown of the above element string is as follows:

(99)   ABC%40XYZ     -->  means "ABC%40XYZ"  (since it is not a percent-encodable AI)
(4300) ABC%40XYZ     -->  means "ABC@XYZ"    (since it is a percent-encodable AI)

               |   (Converted to a GS1 DL URI)
               v

http://... ? 99 = ABC%2540XYZ & 4300 = ABC%40XYZ

In the above (4300) has successfully avoided double escaping (decode to a interpreted value, then re-encode for URI; equivalent to "lift and shift"); (99) isn't double-escaped because it wasn't percent-escaped in the first instance.

It shows that in a world where the GS1 system assigns meaning to raw values AI and where it is that meaning (rather than the raw AI value) that is respected during format conversions, we have to treat the conversion process for AIs differently depending upon whether or not their raw value has a percent-encodable representation.

Concretely, when a reader operating the GS1 DL URI to element string shim encounters the following URI:

http://... ? 99 = ABC%2540XYZ & 4300 = ABC%40XYZ

Were it not to take the percent-encodability of the AIs into account then it would arrive at the following element string:

...(99)ABC%40XYZ(4300)ABC@XYZ

Which I note is not where we started: (4300) has been corrupted from "ABC%40XYZ" to "ABC@XYZ" via the round trip to a GS1 DL URI.

To avoid such issues, the shim would need to recognise that (4300) is a percent-encodable AI and handle it specially. Dynamic tables => no longer a static algorithm suitable for embedding.

KDean-Dolphin commented 4 months ago

I'm struggling to understand the problem with percent encoding. Let me start with the "regular" AIs, i.e., those that don't have "pcenc" as a specification decorator.

In non-DLURI representations (HRI and barcode), the serial number ABC%40XYZ is represented verbatim. As long as you stay within the AI 82 character set, you can go between these representations without misinterpretation, except for the HRI problem that would be created by serial number ABC(10)XYZ which is being dealt with in issue #8.

When putting serial number ABC%40XYZ in a DLURI, we have to use percent encoding, and end up with ABC%2540XYZ in the path. At the other end, we apply percent decoding to all path elements and query parameters and end up with ABC%40XYZ as expected. That rule has nothing to do with GS1; it's a requirement for proper operation of the web, and the GS1 syntax rules apply only after deconstruction and percent-decoding of the URI. The domain of the serial number is the AI 82 character set, full stop.

AI 4300 and its ilk are different. The domain of the ship-to company name is pretty much any UTF-8 character, and we run into a different set of rules.

GS1 barcodes don't support characters outside the AI 82 character set, so such strings have to be encoded in some way, and percent encoding is a logical choice (with special handling for the tilde '~'). If you're using a generic library, it wastes some space by encoding characters that don't need to be encoded for a barcode, but that's a reasonable tradeoff. That same representation would go into the HRI format because we don't really expect the element string to be processed by humans. The non-HRI representation, however, would be in full UTF-8. Although the GenSpecs is silent on the subject, it would be nonsensical for the percent-encoded version to be printed alongside "SHIP TO COMP" on the label. We see this already with things like the expiry date, which can be printed in non-HRI as a full date.

Putting AI 4300 into a DLURI requires percent encoding just like any other AI, and we don't care about the tilde in that case. What we don't need to do is percent-encode it as if it's going into a barcode and then percent-encode it again to put it into the DLURI.

This does mean, though, that any validator needs to be aware of the representation that it's validating. For any AI that doesn't have "pcenc" as a specification decorator, the validator won't care. For AI 4300 etc., the validator can apply the validation to a barcode data stream or HRI text and accept it verbatim for a DLURI. This can leave a gap, though, as it would be possible to generate an attribute that, if percent-encoded, would exceed the length limitation. A better solution might be something like this:

if specification.format = "pcenc" and representation = "DLURI"
  validation_value = pcencode(value)
else
  validation_value = value

To Terry's point:

To avoid such issues, the shim would need to recognise that (4300) is a percent-encodable AI and handle it specially. Dynamic tables => no longer a static algorithm suitable for embedding.

I think there is still a static algorithm. Let's assume that the input to the algorithm is a set of AI-value pairs. For all except the "pcenc" AIs, constructing the barcode data stream or HRI is exactly the same as before; if an AI is percent-encoded, then there's an additional step before adding the value to the barcode data stream or HRI. It's much more a documentation issue and yes, there are legacy applications to deal with, but I think that any legacy application that's going to get into Scan4Transport or patient demographics will have more to deal with than "Oh, by the way, there are some new rules...".

In short:

input = (AI, value)+
value in its raw domain (numeric, AI 82, AI 39, full UTF-8)

intermediate = (AI, value*)+
value in encoded domain suitable for target format

output = target constructed from intermediate