ISO 10303-11 embedded remark grammar lacks support for blanks and Unicode

ronaldtse commented 3 years ago

As reported here: https://github.com/metanorma/stepmod-utils/issues/62

The current grammar for embedded remark in ISO 10303-11 lacks support for the "blank space" (' ') and does not support Unicode characters (and any diacritics outside 7-bit ASCII). i.e. ISO/IEC 8859 characters are also not supported.

The lack of the space character must be a mistake, because existing remark tags already require spaces. The lack of support for wider characters can limit expressibility of annotations.

Thoughts @TRThurman @brandonsapp ?

zakjan commented 3 years ago

Considering that that spaces are ignored everywhere between tokens in the regular source code, it seems that this rule applies to the remark content as well. However, effectively it would mean that remark content is supposed to be parsed into tokens as the rest of the source.

Currently Expressir doesn't parse remark content into tokens. Simpler rules are used for matching remarks content, "anything except *)" for embedded remarks (nesting is allowed), and "anything except \n" for tail remarks. This also allows for UTF-8 characters inside remark content. Can we stay with this implementation?

TRThurman commented 3 years ago

Currently Expressir doesn't parse remark content into tokens. Simpler rules are used for matching remarks content, "anything except *)" for embedded remarks, and "anything except \n" for tail remarks. This also allows for UTF-8 characters. Can we stay with this implementation?

Does this impact the design of the workflow using the annotated express?

ronaldtse commented 3 years ago

Here are the relevant sections from ISO 10303-11.

7.1.5 Whitespace Whitespace is defined by the following sub-clauses and by 7.1.6. Whitespace shall be used to separate the tokens of a schema written in EXPRESS. NOTE Liberal, and consistent, use of whitespace can improve the structure and readability of a schema.

7.1.5.1 Space character One or more spaces (cell 20 of the EXPRESS character set) can appear between two tokens. The notation \s may used to represent a blank space character in the syntax of the language.

7.1.5.2 Newline A newline marks the physical end of a line within a formal specification written in EXPRESS. Newline is normally treated as a space but is significant when it terminates a tail remark or abnormally terminates a string literal. A newline is represented by the notation \n in the syntax of the language.

The representation of a newline is implementation specific.

Whitespaces are allowed between tokens.

7.1.6 Remarks A remark is used for documentation and shall be interpreted by an EXPRESS language parser as whitespace. There are two forms of remark, embedded remark and tail remark. Both forms of remark may be associated with an identified construct using a remark tag.

7.1.6.1 Embedded remark The character pair ( denotes the start of an embedded remark and the character pair ) denotes its end. An embedded remark may appear between any two tokens.

Any character within the EXPRESS character set may occur between the start and end of an embedded remark including the newline character; therefore, embedded remarks can span several physical lines.

It is specified that the whole remark is considered "whitespace".

It is not specified whether the content inside the remark are parsed to tokens.
Interestingly, the newline character (7.1.5.2) is expressly allowed inside embedded remarks.
Since it says "Any character within the EXPRESS character set may occur", 7.1 "Character set" also includes the space character (7.1.5.1), so a whitespace is also allowed, even though the grammar doesn't state so.

Re: whitespace, I don't think we need to make any changes here.

Re: Unicode -- it is possible to create Metanorma text using a plain-ASCII markup without Unicode, so it is not a problem for Annotated Express now. That is, until someone wishes to explicitly insert Unicode characters, e.g. Japanese あ instead of \u3042.

zakjan commented 3 years ago

Any character within the EXPRESS character set may occur between the start and end of an embedded remark

This explains it. Thanks for confirming!

lutaml / expressir

ISO 10303-11 embedded remark grammar lacks support for blanks and Unicode #69