modelica / ModelicaSpecification

Specification of the Modelica Language
https://specification.modelica.org
Creative Commons Attribution Share Alike 4.0 International
98 stars 42 forks source link

Proposal: Extended UTF-8 symbols in Modelica Classes, Operators and Instances #2357

Open christiankral opened 5 years ago

christiankral commented 5 years ago

Motivation

Modelica is a language to model physical systems and therefore, physical variables are used in equations. In many Modelica libraries, including the Modelica Standard Library, for example, Greek symbols are used. Some examples are:

When I came in touch with Julia some time ago, I really liked the way how variables, operators and functions can be named. One could just use the variable α instead of writing alpha. This is much more like scientists and engineers write variables. Why not using this spelling in Modelica equations?

In Julia also some very useful characters can be used. One thing I really like and use extensively is the prime symbol ´. This symbol is used a lot in electrical engineering to indicate transformed quantities. However, this or a similar symbol may be used to indicate time derivatives in mathematics.

In order to write the Greek alpha in Julia one types \alpha and hits the tab key to auto-convert the letter. So it is pretty easy to access non-ASCII characters through a LaTeX kind of typing.

Proposal of Unicode Identifiers

As Modelica code is stored in UTF-8 format already, it were certainly possible to extend the allowed characters set used in Modelica code. My proposal is to allow further UTF-8 characters in Modelica classes, operators and instances.

I believe, if we do not introduce Unicode class names now, it will most likely never ever happen. As the Modelica Standard Library 4.0.0 will process conversion scripts this a window of opportunity to implement the proposed extension.

A list of allowed variable names in Julia is provided in the Julia documentation I am sure we do not need to support all the Unicode character categories of Julia, but some of them definitely make sense.

I propose to support the at least the following characters:

More possible characters may make sense, depending on further discussions.

HansOlsson commented 5 years ago

This idea of Unicode characters precede Julia and was already used in Plan 9 from Bell Labs (that allegedly introduced UTF-8).

However, the issue is with entering them, since they are normally missing from keyboards (APL solved that by having special keyboards). Requiring tools for entering models seems like a step backwards - and I'm not sure how that will work with GitHub.

One possibility is that people write "Alpha", "\Alpha", or something similarly and there is a standardized mapping to a visual representation.

HansOlsson commented 5 years ago

Language group:

A bit of discussion - including general mathematical markdown, whether lexer handles \alpha -> alpha-character; and quoted identifiers but no conclusion.

christoff-buerger commented 5 years ago

Another use-case for Unicode identifiers, besides mathematics/physics, are Asian languages. Developers are kind of forced to use western = english names; but maybe they would like to use their native language with its own alphabet for naming models and components.

christiankral commented 5 years ago

Modelica is based on physical and mathematical equations. So to me it makes a lot of sense to overcome the 20th century limitations of US-ASCII code variables. Particularly, when thinking of styling and naming guidelines of Modelica code modelica:#2931 it were very beneficial to extend the current character set for variables.

I agree that entering UTF-8 characters is an issue. Personally, I like the idea of a mathematical markdown such that e.g. \alpha maps creates the character α. I think that for most of the usual mathematical symbols and Greek characters this were quite a reasonable effort to implement such a markdown. Otherwise I would see that rather pragmatic. If I have no fancy keyboard to create fancy characters, am not using them or have to find a workaround with copy and paste or character map applications.

In variables and classes we could just allow the same character set as in Julia. Operators such as × or ∥ may either be not allowed at all or restricted to operator overloading.

Personally, I dislike quoted identifiers as they are awkward to read.

I wonder how we can move forward on this issue in case we agree that we move forward with it. To me it makes a lot of sense to create a proposal which can be applied to MSL 4.0.0, otherwise the next chance is in ten or 20 years from now to make it really good, coherent and consistent...

sjoelund commented 5 years ago

Honestly you might not even need to specify any mapping for \alpha to α. We could make it a tool convenience that typing \alpha replaces that \alpha with α (like the Julia CLI does).

AHaumer commented 5 years ago

I have an additonal use case: How to represent a variable from a textbook with an underscore as a Modelica variable? Widely used for Complex variables. grafik How to represent a variable with a line as accent ("upperscore", I don't know the right term) as a Modelica variable? Widely used for the arithmetic mean of a variable or as the negated variable. grafik

HansOlsson commented 5 years ago

It's not clear that allowing Unicode for this is the ideal solution.

It's possible: there are combining characters for them (under we have 0332 "combining low line"; or possibly 0331 "Combining macron below", and above 0305 "combining overline"; or possibly 0304 "combining macron"); but I don't see how people will easy enter them.

Note that if we want to handle sub/super-scripts Unicode is quite restricted; and similarly if we want some over/under for multi-character variables. (The under/over-line should connect - but it seems less clear if we want tilde or similarly.)

In addition at least "i" and "o" have existing characters for the combination; creating an ambiguity. Obviously we could say that Unicode strings should be normalized in some way to avoid that ambiguity - but then we need to define how to normalize.

christoff-buerger commented 5 years ago

In general, I think it is a good idea to go for Unicode identifiers, all UTF-8 encoded. I just like to add a few comments on the proposals so far.

Using escape sequences for convenient encoding of common mathematical and physical symbols, like \alpha for α: It is a requirement to have an easy way of entering common non-keyboard glyphs. But not in terms of LaTex like escape sequences defined in the standard. We either, (1) leave this as an IDE problem (thus, each tool vendor can have his way to ease entering of respective common glyphs) like the proposed "when a user enters \alpha followed by <tabulator>, the glyph entered in fact is the codepoint for α"; or (2) we define in the Modelica standard a third special unicode escape sequence beside the common \uXXXX and \UXXXXXXXX. A proposal for (2) would be something like \u{C*} where C* can be any sequence of characters of a set restricted in the Modelica standard for which the standard defines a mapping to a Unicode codepoint, for example \u{alpha} could be defined to be mapped to α and \u{sum} could be defined to be mapped to ∑. Naturally, all the valid C* sequences we define in the standard must not contain { or }; any undefined sequence C* within a \u{C*} then is an error. We could even support the actual character names of Unicode (cf. https://unicode.org/charts/charindex.html) by supporting a capital \U{C*} beside \u; for example, \u{alpha}\U{ABOVE, COMBINING RIGHT ARROW} would be an arrow above α (GitHub markdown renders such Combining Diacritical Marks for Symbols very ugly and shows the single glyph α⃗, i.e., α followed by the upper arrow; nevertheless it is a single glyph as one can see when marking it for copying for example). The rationale for the new third special Unicode escape sequence simply is, that just writing \napla for example is in fact a newline followed by apla. There is no sane way to make such LaTex like escapes work properly in the Universe of Unicode; in LaTex one often is forced to escape the escapes for example by embracing them in { and } because LaTex is a pre-Unicode design and in fact not a good ideal.

On valid Unicode identifiers: We should by no means reinvent the wheel and come up with a Modelica-specific definition of what valid identifiers are, i.e., which Unicode sequences are permitted as such like restrictions on the kind of the first glyph of an identifier or the glyphs following. There is a lot of knowledge on that in well-established languages and we should just take the definition of one of these. What I would like, is a Modelica-Design group discussion where somebody presents the definitions of a bunch of commonly used languages supporting Unicode, comparing them and giving examples of the types of identifiers they permit; lets say of Java, C#, Python 3, Julia, C++.

christoff-buerger commented 5 years ago

How to represent a variable from a textbook with an underscore as a Modelica variable? How to represent a variable with a line as accent ("upperscore", I don't know the right term) as a Modelica variable?

@AHaumer: Please consider the Combining Diacritical Marks for Symbols example in my post above. It explains how a proper Unicode treatment of your problem looks like. For example, one can use the Unicode code point 0x0305 (cf. https://unicode.org/charts/PDF/U0300.pdf) to add an upperscore to a glyph. I am not saying this is nice, but it is in the spirit of Unicode. Note, that this is only to add upper- and underscores to single glyphs; it is no replacement for sub- or superscription, which is anyway not a Unicode problem, but a text-rendering layout problem.

beutlich commented 5 years ago

lets say of Java, C#, Python 3, Julia, C++.

Or Wolfram Language.

henrikt-ma commented 5 years ago

@AHaumer: Please consider the Combining Diacritical Marks for Symbols example in my post above. It explains how a proper Unicode treatment of your problem looks like. For example, one can use the Unicode code point 0x0305 (cf. https://unicode.org/charts/PDF/U0300.pdf) to add an upperscore to a glyph. I am not saying this is nice, but it is in the spirit of Unicode. Note, that this is only to add upper- and underscores to single glyphs; it is no replacement for sub- or superscription, which is anyway not a Unicode problem, but a text-rendering layout problem.

But how are other languages dealing with the canonicalization issue? For example, take the Swedish letter 'å'. In Sweden, it is clear that this should be LATIN SMALL LETTER A WITH RING ABOVE (UTF-8: C3 A5), but in most parts of the world, it is probably just a Unicode sequence for an 'a' with an COMBINING RING ABOVE, å (UTF-8: 61 CC 8A). With the two appearing the same, code using these would be completely obfuscated unless they are treated as equivalent.

sjoelund commented 5 years ago

In Julia, those are considered the same:

å = 1 # Using C3 A5
println(å) # Using 51 CC 8A

Sort of...

julia> eval(Symbol("\u00e5"))
1
julia> eval(Symbol("a\u030a"))
ERROR: UndefVarError: å not defined

So the parser will normalize the string, but if you do some metaprogramming and don't normalize your code points, you may end up with unexpected results :)

HansOlsson commented 5 years ago

So the parser will normalize the string,

Which begs the question: which normalization?

And the answers seems to be that Julia used NFC and now uses NFC with some custom additions, whereas Python 3 uses NKFC. Since some of the benefits are for people not using latin alphabets I think we need to involve them: in general internationalization shows that cultures differ in unexpected ways and one generally has to rethink more than anticipated.

I also believe we need to have multilingual support #302 in place first; as both of these allow us to translate a library to another natural language - but multilingual supports is better for re-use and collaboration; so we don't want people to make a translated copy of MSL due to lack of multilingual support. (I have seen translations of parts of Modelica.Fluid - so it will happen.)

christoff-buerger commented 5 years ago

I also believe we need to have multilingual support #302 in place first [..]

Note that if we want to handle sub/super-scripts Unicode is quite restricted [..]

To me, both are off topic; these are completely orthogonal issues. We can decide on the supported encoding for identifiers regardless of (1) additional translation support or (2) text rendering support.

Unicode is about glyphs; the individual symbols of an alphabet and their symbolic meaning. It is not about text or natural language per se -- it is not about the meaning of sentences, the relation of text parts either grammatical or due to layout (i.e., spatial relations). It is not a text rendering engine. Problems like sub- and superscription, underscoring, overscoring or striking through, sentences or paragraphs, lists, floating text and image embedding are not at all handled by Unicode and are a completely different topic. Any future decision on text rendering support we agree can be combined with Unicode support and should be taken separately. At most, we can consider how well different alphabets are supported by common fonts; but just to be crystal clear: how to render a glyph or combinations of glyphs is completely up to the font developer and the text rendering engine -- fonts are an artistic and text rendering problem not a Unicode issue.

Likewise, translation support is completely orthogonal. Again, translation is about text meaning, which is not the topic of Unicode.

We should not obfuscate these problems with the problem Unicode treats: unique, universal encoding of glyphs (individual symbols of different alphabets).

christoff-buerger commented 5 years ago

But how are other languages dealing with the canonicalization issue? For example, take the Swedish letter 'å'. In Sweden, it is clear that this should be LATIN SMALL LETTER A WITH RING ABOVE (UTF-8: C3 A5), but in most parts of the world, it is probably just a Unicode sequence for an 'a' with an COMBINING RING ABOVE, å (UTF-8: 61 CC 8A). With the two appearing the same, code using these would be completely obfuscated unless they are treated as equivalent.

I think the problem you describe is on the user's side. It is about the try to use a glyph without knowing its meaning. The user who tries to render å by writing a followed by a circle above in terms of a Combining Diacritical Mark clearly does not know å in the first place. He starts using it out of its semantic context (alphabetic misuse). We could try to attenuate such missuses by automatically applying some kind of normalization/canonicalization which can help a lot. But if we go Unicode, such issues will always remain.

If one doesn't know what one does, one should not expect to succeed: both å are not the same, even they look the same. If one does not understand why, then the contextual knowledge is missing. I am also not starting to try to compose Chinese glyphs out of Combining Diacritical Marks because I don't know anything about the Chinese alphabet.

If we like to avoid this issue at all, the only solution is to not support Unicode and stay ASCII -- or train everybody to a level he knows all kind of alphabets :)

Like said, normalization can help with this, but won't catch all cases. It also has the problem, that text of users who know what they do is changed to some canonical form, which may then be wrong considering the used alphabet. For example, a followed by the circle Combining Diacritical Mark may in fact be correct, because the context was not the Swedish å but an a with a circle above (let's say some funny math notation). Anyway, I consider normalization more useful even considering this issue.

christiankral commented 5 years ago

If we like to avoid this issue at all, the only solution is to not support Unicode and stay ASCII -- or train everybody to a level he knows all kind of alphabets :)

I am very much in favor of supporting Unicode over keeping ASCII. I guess we can never avoid that some stumbles into some issue from time to time. Yet, trying to avoid such issues as much as possible is possible by having proper documentation and guidelines.

HansOlsson commented 1 year ago

Push back this one? The normalization issue has not been clarified yet.