Ability to parse version string of schema (ASN.1)

ronaldtse commented 3 years ago

In metanorma/annotated-express#38 there is a need to find out parts of an EXPRESS schema version string.

The SCHEMA version string can be in ASN.1 format:

From 10303-11:

SCHEMA geometry_schema ’version_1’;

=> schema.version.string => 'version_1'

From 10303-11:

SCHEMA support_resource_schema ’{ISO standard 10303 part(41) object(1) version(8)}’;

=>

schema.version.string
=> '{ISO standard 10303 part(41) object(1) version(8)}'

schema.version.asn1[3].name
=> '10303'

schema.version.asn1[4].name
=> 'part'

schema.version.asn1[4].value
=> '41'

In STEPmod:

SCHEMA action_schema '{iso standard 10303 part(41) version(8) object(1) action_schema(1)}';

Then we can do:

// `schemas` is all schemas
// `schema` is one schema
{% assign reference_froms = schema.interfaces | where: "kind", :REFERENCE %}
{% for from in reference_froms %}
{% assign ref_schema = schemas | where: "id", from.schema.id %}
{{ from.schema.id }}:: ISO {{ ref_schema.version.asn1[3].name}}-{{ ref_schema.version.asn1[4].value }}
{% endfor %}

ronaldtse commented 3 years ago

This task does not block metanorma/annotated-express#38 because all referenced schemas are within Part 41 (i.e. internal referencing), but it will apply to other document parts.

ronaldtse commented 3 years ago

https://github.com/sdaubert/rasn1 seems like a pretty good gem for parsing ASN.1 in Ruby.

zakjan commented 3 years ago

https://github.com/sdaubert/rasn1 parses binary ASN1 (DER / BER encoding), which is used for example for TLS certificates. See https://github.com/sdaubert/rasn1/wiki/Parsing and https://github.com/sdaubert/rasn1/wiki/Encoding for details.

Where is it mentioned in the spec that {iso standard 10303 part(41) version(8) object(1) action_schema(1)} is ASN1? I couldn't find such ASN1 encoding anywhere else. It seems to me that it needs to be parsed with custom code (split(' ') etc.).

zakjan commented 3 years ago

There is this note in ISO 10303-11 spec:

NOTE For schemas defined in ISO 10303, the use of an information object identifier is specified that includes a version identifier. The meaning of the object identifer is defined in ISO/IEC 8824-1, and is described in ISO 10303-1. The use of this object identifier as a schema version identifier is encouraged.

Do we have access to ISO/IEC 8824-1 and/or ISO 10303-1?

ronaldtse commented 3 years ago

Let me find out.

ronaldtse commented 3 years ago

@zakjan this is text from N3554 (thanks @TRThurman ) that describes the usage best.

5.4.4 Use of ASN.1 Identifiers in SC 4 standards (optional)

5.4.4.1 Information object registration annex

Some SC 4 standards can have an “Information object registration annex”. This annex defines the information object identifier for the standard as specified by ISO/IEC 8824-1 [9]. As a consequence, the SC 4 standard shall specify a reference to ISO/IEC 8824-1.

The status of the annex, in which the information object registration is specified, varies according to its referencing within the text.

The structure of the annex, in which the information object registration is specified, varies according to the nature of the standard.

NOTE In ISO 10303 this annex is the last normative annex.

In a SC 4 standard that includes one or more schemas, the annex has two subclauses: Document identification and Schema identification. In a standard that does not include schemas, there is no subdivision; the content of the annex corresponds to that of the Document identification subclause in the first case.

The complete mechanism of the information object registration is described in ISO/DIS 10303-1:2020 Clause 7 Information object registration scheme.

A tutorial on the ASN.1 identifier is provided in Annex A.

5.4.4.2 Document identification

An example of the text to introduce the document identification for standards prepared by SC 4 reads as follows (see 5.3. for implementing the required text).

[General:SC4_annex_obj_reg-d]

To provide for unambiguous identification of an information object in an open system, the object identifier

{iso standard 8000 part(065) version(1)}

is assigned to this document. The meaning of this value is defined in ISO/IEC 8824-1, and is described in ISO 10303-1.

[end_General]

5.4.4.3 Schema identification

For the SC 4 standards that define schemas or include an electronic insert, include a subclause titled “Schema identification” within the information object registration annex. If the standard includes more than one schema or electronic insert then further subdivide this subclause such that each subdivision identifies one schema or one electronic insert. Order these subclauses in the same sequence as the schemas themselves. If the standard includes only one schema then use the text given below within the “Schema identification” subclause.

An example of text is provided below:

[General:SC4_annex_obj_reg-s]

To provide for unambiguous identification of the schema-name in an open information system, the object identifier

{iso standard 10303 part(42) version(10) object(1) topology-schema(2)}

is assigned to the topology-schema schema. The meaning of this value is defined in ISO/IEC 8824-1, and is described in ISO 10303-1.

[end_General]

Annex A (informative) Tutorial on ASN.1 identifiers

A.1 Introduction

SC 4 uses three separate terms to manage the various components of its standards. These terms are as follows:

edition;
version;
release.

The term “edition” identifies a published document. The method of identifying the edition is by the year of publication. Thus, we refer to Part 21 as ISO 10303-21:1994. When a second edition of Part 21 is published, its identifier will differ by the year of its publication.

The term “version” identifies the normative content of a standard. For the initial edition of each SC 4 standard, there is a one to one correspondence between the edition and the version. However, if technical corrigenda or amendments to the standard are published, the version of the technical content changes. In general, there is not a simple relationship between edition and version. The version is meant to identify the technical content to which conformance may be claimed.

For example, because the published version of Part 21 contained technical errors, it was necessary to issue a technical corrigendum to this part. The version of the original publication is 1; the version of effective standard, after applying the technical corrigendum is 2. Note that version 2 of Part 21 is documented in the two publications, the IS of Part 21, and the TC to Part 21, and not just a single document. Similarly, if an amendment to Part 21 is adopted, the version of the effective standard will be 3, and will be documented in the three publications taken together.

The version is defined as part of the object identifier defined in an annex of each SC 4 standard. The value of that object identifier is described below.

The term “release” is used by SC 4 to manage the publications of groups of parts; historically, this term has been primarily applied to ISO 10303, although future planning for SC 4 standards includes releases that comprise parts of several standards. The release is not explicitly defined within any part of a standard. It is used solely for managing the development of STEP.

A.2 Object identifiers

An object identifier is a primitive data type defined in ASN.1, ISO/IEC 8824-1. The value identifies a node in a tree structure by providing a sequence of (positive) integers, each of which identifies a link in the tree. The notation used in SC 4 is the value notation defined in ISO/IEC 8824.

The syntax of this value notation is a sequence of node specifiers enclosed in braces (curly brackets) and separated by spaces. The syntax of each node specifier is one of the following:

a number;
a symbol;
a symbol followed by a number in parentheses.

Whichever syntax is chosen, the resulting value must reduce to a sequence of integers. These choices are described below.

A number. This syntax is self-identifying; the value of the node identifier is the value of the number. Any object identifier can optionally be written as a series of numbers. See the example below.
A symbol. This syntax can be used only for the first or second node, and the only symbols that may be used are those defined in the annexes of ISO/IEC 8824-1. For our purposes, this restriction means that the first part of all object identifiers must be

{ iso standard }

which is equivalent to the object identifier

{1 0}

A symbol followed by a number in parenthesis. If this form is used, the value of the node is the value of the number. The symbol is a local variable that is automatically assigned the value of the number. Because there are no other uses for this symbol in the syntax, the only utility of this form is to give a human readable idea of the meaning of the node. Thus, we will use “version(1)” to indicate that we are dealing with the first version of something. We can equally refer to this node as “1”. Both forms evaluate to 1; the first form associates the semantics of “version” with this value.

The lexical syntax of terms in the object identifier is similar to that of EXPRESS, except that occurrences of underscore (_) shall be replaced by hyphen(-).

In the annexes, ISO/IEC 8824-1 defines the four topmost levels of all object identifiers. In particular, it defines the form

{ 1 0 n nn }

to be an object identifier that identifies an ISO standard number "n", part "nn". It then provides for the committee or subcommittee that wrote the standard to assign other nodes beneath this for identifying information objects related to this standard. Note that this identifier can also be written

{ iso standard n part(nn) }

which is closer to the form we normally use. In this notation, the defined symbol "iso" has the value 1, and the defined symbol "standard" has the value 0.

To repeat, ISO (in ISO/IEC 8824-1) defines the interpretation of the nodes at the first four positions of these object identifiers. The subcommittee writing the standard (in this case, SC 4) controls the semantics of nodes at lower positions.

SC 4 has decided to associate the fifth node with the version of the information object being identified. This decision means that a standard form of the object identifier for the (part of the) standard considered as a whole is

{ iso standard 10303 part(nn) version(v) }

SC 4 has adopted the convention that the sixth node identifies an object type, and the succeeding node or nodes identify a specific instance of that object type. The initial release defined only a single value for the object type; object(1) indicates that the object being identified is a schema. In the future, SC 4 may define other values for this node to cover other information objects such as entities, defined types, conformance classes, or parts libraries. Object values of 2 and greater are available for this purpose when the need arises. As of today, however, the only valid value of the sixth node is 1.

As corrigenda or amendments to the standards or new editions of the standards are published, the version number of the total content of the standard shall be increased by 1 to reflect the new content. It may be that in adopting a new edition of some standard, some information objects (schemas) within the standard will be unchanged from the previous versions. In this case the object identifier that identifies that information object (schema) should be the same as in the previous version of the standard, indicating that that particular item (schema) is unchanged.

ronaldtse commented 3 years ago

@zakjan it would definitely be easier to just directly parse this string without an ASN1 library.

zakjan commented 3 years ago

Ok, I understand that schema version string is ASN1 object identifier, because of its tree structure and using reserved tree node symbols, ~but the string encoding is custom~.

Parse result for '{ISO standard 10303 part(41) object(1) version(8)}' could be an array of objects:

oid:
- name: ISO
- name: standard
- name: 10303
- name: part
  value: 41
- name: object
  value: 1
- name: version
  value: 8

zakjan commented 3 years ago

Actually, it is ASN.1 notation for object identifiers, see https://luca.ntop.org/Teaching/Appunti/asn1.html section 5.9.

We could use an existing ANTLR4 grammar for parsing it: https://github.com/tysonite/asn1-compiler/blob/master/compiler/src/main/antlr4/ASN1.g4#L305-L311 or https://github.com/antlr/grammars-v4/blob/master/asn/asn/ASN.g4#L461-L470

ronaldtse commented 3 years ago

@zakjan Using the official grammar would be good (but maybe it belongs to another gem?).

oid:
- name: ISO
- name: standard
- name: 10303

This part is not consistent with the rest, since it is already described that those symbols represent an integer (1st position "iso" => 1, 2nd position "standard" => 0, 3rd position is "standard number" here 10303).

Is it possible to actually have a hash for this? i.e.:

oid:
  organization:
    symbol: iso
    number: 1
  type:
    symbol: standard
    number: 0
  standard_number:
    symbol: standard
    number: 10303
  standard_part:
    symbol: part
    number: 41
  object:
    symbol: object
    number: 1
  version:
    symbol: version
    number: 8

zakjan commented 3 years ago

String names have their mapping to integers registered. You can try it at http://www.oid-info.com/, our sample version string shows that the first two items are registered (iso = 1, standard = 0). The rest of items are custom.

Screen Shot 2021-03-24 at 8 55 56

Hash would work for easier access to the specific items by name, but object identifier is a tree, the position in the tree is significant.

ronaldtse commented 3 years ago

I see, then please update the structure so that we differentiate against registered vs assignable values. Thanks!

zakjan commented 3 years ago

Considering that iso standard are the only expected registered values, I think we can support recognizing only these two, right?

oid:
- name: iso
  value: 1
- name: standard
  value: 0
- name: nil
  value: 10303
- name: part
  value: 41
- name: object
  value: 1
- name: version
  value: 8

ronaldtse commented 3 years ago

Yes this works. Thanks!

TRThurman commented 3 years ago

One note: Each instancee of a schema string does only reference to the document that it is contained in. That is necessary because the 'version' value is the version of the document the schema was last modified in.

lutaml / expressir