Open TRThurman opened 5 months ago
@TRThurman Given this information, it means that even in the "remarks" (i.e. Annotated EXPRESS remarks) only ASCII characters are allowed, right?
This means that we cannot have unicode etc inside the EXPRESS remarks?
Ultimately, we need to revise this, right? Imagine CJK characters as content?
@ronaldtse We shall not have unicode etc. in EXPRESS at all, anywhere.
During the last Seoul meeting, several Japanese members wanted to use Annotated EXPRESS for their own work in Japan, which involves Japanese content. Not a reasonable use case?
Let's deal with that separately on ELF. Recall that annotated EXPRESS shall be processable by current EXPRESS tools and we can't change eengine and eep at this time.
We should ask our Japanese colleagues if an alternate encoding would be acceptable.
I don't think there is a Japanese ASCII encoding...
From my favorite search engine:
Romaji: This is a method of writing Japanese using the Latin alphabet. It's not a true encoding, but a transliteration system.
Example: "こんにちは" (Konnichiwa) in romaji is simply "konnichiwa"
Kunrei-shiki and Hepburn systems: These are specific standardized forms of romaji with slightly different rules.
ASCII-JIS: This is an actual encoding that uses ASCII characters to represent Japanese characters. It's based on the JIS X 0201 standard.
Example: "こんにちは" might be represented as "\x1B$B$3$s$K$A$O\x1B(B"
Base64 encoding: While not specifically for Japanese, you can encode UTF-8 Japanese text into Base64, which uses only ASCII characters.
Example: "こんにちは" in Base64 is "44GT44KT44Gr44Gh44Gv"
Numeric character references: You can represent Unicode characters using their numeric values in ASCII.
Example: "こんにちは" as HTML numeric character references: "こんにちは"
😓 No one will use these as textual encodings. There is a reason Unicode exists...
no argument there. I just want something to give to ISO for the work we have committed to do, which requires using existing EXPRESS. The Japanese can help campaign for an update to EXPRESS to use UTF-8 if that is their documented requirement. It should be coming from the Japanese directly to WG11.
Being severely under caffeinated today I will put on my math hat and ask if I can't solve the problem can turn it into one that I know the answer?
Given eengine/eep can only use ascii Annotated express combines 10303-11 and "documentation" There is a reasonable use case to need non-ascii in documentation portions of the annotated express.
Would it be possible/feasible to add a stop code that eengine/eep would use to stop processing the file before it gets to EOF? Or add a wrapper script that extracts the annotated express into the 10303-11 part and sends that to the express tool?
I don't really care once we get the validation report published. The critical code base is easyEXPRESS. We have to modify easyEXPRESS and I don't want to do that just yet. UTF-8 characters would be a 'pop-up' in the interface but those 'foreign' interfaces aren't working yet. So for the short-term solution, Ron, please replace the UTF-8 characters (they are symbols) with equivalent (HTML?) ascii encoding. In order to make this work with embedded UTF-8, easyEXPRESS would have to have an output function that had two targets: compiler, publication.
To re-iterate: We need the UTF-8 characters removed so we can publish the validation report. The validation report is due in July.
@ronaldtse @stuartgalt I requested some input from Sylvere Krima on supporting UTF-8 characters in tagged remarks. That won't help near-term.
Problem: Annotated EXPRESS files auto-generated contain UTF-8 characters e.g., the degree symbol. or the less than equal symbol. While these are permitted in e.g., descriptions.xml, they are not permitted in EXPRESS files. EXPRESS defines the "EXPRESS character set" to denote characters permitted in EXPRESS files. From ISO 10303-11:2004:
Note: STEP developers are familiar with using HTML encoding for special symbols, but typically only use those in mapping specifications.