metanorma / stepmod-utils

Tools for working on the STEPmod repository.
1 stars 0 forks source link

(URGENT) stepmod-annotate-all shall generate characters only allowed in EXPRESS language #261

Open TRThurman opened 4 months ago

TRThurman commented 4 months ago

Problem: Annotated EXPRESS files auto-generated contain UTF-8 characters e.g., the degree symbol. or the less than equal symbol. While these are permitted in e.g., descriptions.xml, they are not permitted in EXPRESS files. EXPRESS defines the "EXPRESS character set" to denote characters permitted in EXPRESS files. From ISO 10303-11:2004:

A schema written in EXPRESS shall use only the characters in the following character set: characters allocated to cells 09, 0A, 0D, the graphic characters lying in the range 20 to 7E of ISO/IEC 10646, and the special character \n signifying the newline. This set of characters is called the EXPRESS character set.

Note: STEP developers are familiar with using HTML encoding for special symbols, but typically only use those in mapping specifications.

ronaldtse commented 3 months ago

@TRThurman Given this information, it means that even in the "remarks" (i.e. Annotated EXPRESS remarks) only ASCII characters are allowed, right?

This means that we cannot have unicode etc inside the EXPRESS remarks?

Ultimately, we need to revise this, right? Imagine CJK characters as content?

TRThurman commented 3 months ago

@ronaldtse We shall not have unicode etc. in EXPRESS at all, anywhere.

ronaldtse commented 3 months ago

During the last Seoul meeting, several Japanese members wanted to use Annotated EXPRESS for their own work in Japan, which involves Japanese content. Not a reasonable use case?

TRThurman commented 3 months ago

Let's deal with that separately on ELF. Recall that annotated EXPRESS shall be processable by current EXPRESS tools and we can't change eengine and eep at this time.

TRThurman commented 3 months ago

We should ask our Japanese colleagues if an alternate encoding would be acceptable.

ronaldtse commented 3 months ago

I don't think there is a Japanese ASCII encoding...

TRThurman commented 3 months ago

From my favorite search engine:

  1. Romaji: This is a method of writing Japanese using the Latin alphabet. It's not a true encoding, but a transliteration system.

    Example: "こんにちは" (Konnichiwa) in romaji is simply "konnichiwa"

  2. Kunrei-shiki and Hepburn systems: These are specific standardized forms of romaji with slightly different rules.

  3. ASCII-JIS: This is an actual encoding that uses ASCII characters to represent Japanese characters. It's based on the JIS X 0201 standard.

    Example: "こんにちは" might be represented as "\x1B$B$3$s$K$A$O\x1B(B"

  4. Base64 encoding: While not specifically for Japanese, you can encode UTF-8 Japanese text into Base64, which uses only ASCII characters.

    Example: "こんにちは" in Base64 is "44GT44KT44Gr44Gh44Gv"

  5. Numeric character references: You can represent Unicode characters using their numeric values in ASCII.

    Example: "こんにちは" as HTML numeric character references: "こんにちは"

ronaldtse commented 3 months ago

😓 No one will use these as textual encodings. There is a reason Unicode exists...

TRThurman commented 3 months ago

no argument there. I just want something to give to ISO for the work we have committed to do, which requires using existing EXPRESS. The Japanese can help campaign for an update to EXPRESS to use UTF-8 if that is their documented requirement. It should be coming from the Japanese directly to WG11.

stuartgalt commented 3 months ago

Being severely under caffeinated today I will put on my math hat and ask if I can't solve the problem can turn it into one that I know the answer?

Given eengine/eep can only use ascii Annotated express combines 10303-11 and "documentation" There is a reasonable use case to need non-ascii in documentation portions of the annotated express.

Would it be possible/feasible to add a stop code that eengine/eep would use to stop processing the file before it gets to EOF? Or add a wrapper script that extracts the annotated express into the 10303-11 part and sends that to the express tool?

TRThurman commented 3 months ago

I don't really care once we get the validation report published. The critical code base is easyEXPRESS. We have to modify easyEXPRESS and I don't want to do that just yet. UTF-8 characters would be a 'pop-up' in the interface but those 'foreign' interfaces aren't working yet. So for the short-term solution, Ron, please replace the UTF-8 characters (they are symbols) with equivalent (HTML?) ascii encoding. In order to make this work with embedded UTF-8, easyEXPRESS would have to have an output function that had two targets: compiler, publication.

TRThurman commented 3 months ago

To re-iterate: We need the UTF-8 characters removed so we can publish the validation report. The validation report is due in July.

TRThurman commented 3 months ago

@ronaldtse @stuartgalt I requested some input from Sylvere Krima on supporting UTF-8 characters in tagged remarks. That won't help near-term.