modelica / ModelicaStandardLibrary

Free (standard conforming) library to model mechanical (1D/3D), electrical (analog, digital, machines), magnetic, thermal, fluid, control systems and hierarchical state machines. Also numerical functions and functions for strings, files and streams are included.
https://doc.modelica.org
BSD 3-Clause "New" or "Revised" License
479 stars 169 forks source link

String length of UTF-8 stings #3789

Open beutlich opened 3 years ago

beutlich commented 3 years ago

What is the string length of UTF-8 strings? Is it the number of characters or the number of UTF-8 code points?

Example:

model M
  Integer len = Modelica.Utilities.Strings.length("日本語!")
    "Length: Is it number of characters or number of UTF-8 code points?";
end M;

might be converted to

#include <string.h>
int main() {
    int len_characters = (int)wcslen(L"\u65e5\u672c\u8a9e\u0021"); // -> 4
    int len_code_points = (int)strlen("\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\x21"); // -> 10
}
HansOlsson commented 3 years ago

What is the string length of UTF-8 strings? Is it the number of characters or the number of UTF-8 code points?

There are more choices https://www.unicode.org/faq/char_combmark.html#17 indicate that another variant is counting grapheme clusters, where one grapheme नि can be built from two code points (or more). (Confusingly a "grapheme" is also called "character".)

That is important if we plan to use the length for sub-string handling; and is also more consistent with regards to different canonical variants (e.g., this matters to ensure consistent handling of e.g. ö regardless of whether it is one code point or stored as two code points ¨ (well, the joining variant of it: ̈ ) and o).

Personally I would prefer that we consistently use one of the extremes: byte-count or grapheme-count; and since we don't have so much string manipulation I believe byte-count is simplest; if we just document it clearly and have the possibility to later add grapheme-count based handling.

henrikt-ma commented 3 years ago

Yes, byte count seems simplest, and it is crucial that all the string functions operating on or returning "character indices" use the same concept.

I still have conceptual concerns regarding counting bytes instead of something more semantically meaningful for a string of characters, but I don't see a realistic way to introduce such semantics for the existing functions. I fully agree with @HansOlsson that this would require a new set of functions.

What we can do now would be to go through all documentation and rewrite it to clearly operate on bytes rather than characters of a string. In an ideal world, we'd also rename the package StringsByteStrings.

Isn't it also a big problem that the character encoding isn't specified for the external language interface? All I can see is that a String is null-terminated in both external "C" and external "FORTRAN 77". That Modelica source code is stored as UTF-8 on file would make UTF-8 the natural choice in the external language interfaces as well, but this isn't specified anywhere as far as I can see.

christoff-buerger commented 3 years ago

another variant is counting grapheme clusters, where one grapheme नि can be built from two code points (or more).

That is an important point; grapheme are kind of annotations to codepoints that could also be standalone glyphs without beeing annotated. A good example is the mathematical prime grapheme ', that can be used to denote priming of mathematical variables as typical in mathematical notations, for example a' or a''or a''' and so one (I didn't copy and paste the actual grapheme -- most tools break at some point w.r.t. grapheme so just stay away). Thus there are infinite many combinations. Grapheme are glued to the codepoint they annotate/enhance/modify (use your choice of word to denote that math-prime for example makes no sense without a mathematical variable/symbol primed); grapheme can not exist as such, they are related to a codepoint they annotate. In principle every codepoint can be annotated this way -- but of course it only makes sense for ones that have a human interpretation, like priming math numbers or stressing/emphasizing certain aspects in some natural languages (where for example several pronunciations for the same word exist, with different cultural implications; I cannot express this any better since I do not speak any such language ... it is hard for Europeans to grasp that idea).

The point is: It makes no sense to count grapheme as unique entities, because they are not; they are shipped with the codepoint they annotate and do not exist without it. If you really need to talk about them, a good API would be something around the lines of "get me the (number of) grapheme annotated to this codepoint". But this does not change the number of glyphs in the whole string nor is it related (please note, a grapheme is a Unicode codepoint; but it is a not idependently existing one; it amends another codepoint forming a single glyph)!

I am saying all of this because the following is wrong:

That is important if we plan to use the length for sub-string handling; and is also more consistent with regards to different canonical variants (e.g., this matters to ensure consistent handling of e.g. ö regardless of whether it is one code point or stored as two code points ¨ (well, the joining variant of it: ̈ ) and o).

Grapheme cannot be sub-stringed/split apart from the codepoint they annotate. This is wrong! The codepoint and its grapheme form one logical thing; in best case one can strip the grapheme of, like unpriming a' to just a, but one certainly cannot split a' into a string a and the string ', because the grapheme ' cannot exist as such (this has no human interpretation and that is all Unicode -- as a universal encoding for any possible language, i.e., something with a meaning for humans -- is about).

This also means, that some codepoint and its related grapheme that happen to look the same as (are rendered the same, have a not distinguishable glyph from) another codepoint are nevertheless still not the same!

ö is ö and nothing else is ö but ö! To annotate o with the grapheme ¨ is something completely different. Just because it looks the same -- or should I better say some tools render it the same way, because I don't know what an o with an annotated ¨ is supposed to mean; which language is that? -- doesn't mean it is the same. This is a typical misuse of Unicode. Think of it as kind of the following scenario "an explorer around the 15th century encounters an unknown civilization; he tries to communicate by testing some phrases he thinks that make universal sense; he happens to hit a word that has some meaning in the other language (same pronunciation; same writing etc); well, it unfortunately means something different". o plus grapheme ¨ vs. ö is that case if it comes to a single character/glyph. Just seeing the grapheme ¨ standing around alone, without any useful context, is simply wrong.

To make a long story short: (1) count bytes and accessing of bytes for low-level API and (2) also support counting and accessing codepoints for high-level API, without any special distinguishment between graphemes and other ordinary codepoints (count graphemes just as codepoints). These are API functions around the lines of

/*
    Note: For simplicity, I assume strings are \0 terminated.
      If not, also the string length has to be passed as argument; and the function get_number_of_bytes makes no sense.
      Also note, that get and set functions may fail, if pos is out of bounds; a proper API should return error codes.
      Likewise, set may fail to reallocate.
*/
// (2):
size_t get_number_of_codepoints(
    const uint8_t* string)
uint32_t get_codepoint(
    const uint8_t* string,
    size_t pos)
uint8_t* set_codepoint( // Returns new string (may be newly allocated if reallocation is required)
    uint8_t* string,
    size_t pos,
    uint32_t new_value)
// (1):
size_t get_number_of_bytes(
    const uint8_t* string)
uint8_t get_byte(
    const uint8_t* string,
    size_t pos)
void set_byte(
    uint8_t* string,
    size_t pos,
    uint8_t new_value)

Note, that this API is kind of critical since one can set/break strings apart inbetween an ordinary codepoint and its annotated graphemes. To stay with above example, we can split a' into a and ' or replace a by some alphabet character that has no meaning if annotated with '.

If you want to go crazy and avoid such break-off graphemes, you can instead access codepoints and their annotated graphemes (i.e., whole glyphs) by means of multi-dimensional values for the codepoint and its arbitrary number of graphemes. This API is around the lines of

// (2) for masochists:
size_t get_number_of_glyphs(
    const uint8_t* string) // Returns for any string the same as (2) in above listing since graphemes do NOT increase the number of glyphs.
void get_glyph(
    const uint8_t* string,
    size_t pos,
    uint32_t* glyph, // Array of codepoint + its graphemes
    size_t* glyph_size) // 1 + number of graphemes
unit8_t* set_glyph( // Returns new string (may be newly allocated if reallocation is required)
    uint8_t* string,
    size_t pos,
    const uint32_t* new_value, // Array of codepoint + its graphemes
    size_t glyph_size) // 1 + number of graphemes

As a last note: (a) I do not know of any algorithm to decide if a sequence of graphemes annotated to a codepoint are valid -- i.e., have an interpretation in the alphabet the codepoint is part of -- and (b) I do not know of any Framework that guarantees it can render any valid combination of graphemes proper. W.r.t. (b) it would for example be required to render proper glyphs composed of many different graphemes, for example, a math symbol that is primed, underlined and overlined. (a) would be required to decide if a string is erroneous (because if it has no interpretation in its encoded alphabets it is nonsense -- if you want to transmit arbitrary data then use byte arrays, not Unicode).

christoff-buerger commented 3 years ago

I edited my above post to better stress the difference between codepoint and glyph. Graphemes are codepoints; but they are not standalone glyphs; they must be annotated to a codepoint that could be a unmodified standalone glyph, forming a new valid glyph in the annotated codepoint's alphabet.

I can only propose: do just a byte-level API. More is not supported, bad luck -- find some meaning on life :)

henrikt-ma commented 3 years ago

I'm not sure I buy the role of glyphs here, but it doesn't seem to have a major impact on the discussion. I think the important concept we are looking for is a grapheme cluster, as opposed to code points (which are very easy to count and find the boundaries of).

Anyway, to show the relevance of going into these details, I think that one also needs to show that the suggested entity to be counted is something that is practically feasible (not require things like unicode normalization or detailed knowledge about all the individual code points). Hoping to find a simple answer, I came across this highly relevant FAQ http://unicode.org/faq/char_combmark.html:

How are characters counted when measuring the length or position of a character in a string?

The grapheme cluster answer to this question contains links to two other pages, where the mere amount of information makes me lose faith in the approach. It makes me think that operating on code points would be the more realistic alternative to operating on the bytes ("code units").

christoff-buerger commented 3 years ago

I'm not sure I buy the role of glyphs here, but it doesn't seem to have a major impact on the discussion. I think the important concept we are looking for is a grapheme cluster, as opposed to code points (which are very easy to count and find the boundaries of).

The glyph is just the correct representation of a grapheme cluster for end-user consumption (typically correct rendering). It is a shorter word for essentially the same, emphasizing the main point of grapheme clusters and even Unicode at all: encoding something with a meaning in some alphabet. Or in other words, citing the Unicode spec:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

This means, that if you want to support proper string processing -- which means processing of text written in some alphabet obeying the rules of that alphabet -- you must consider glyphs. But that is insane!

Maybe you reread what I wrote under that umbrella, because I think it explains a lot. The glyph-based API, I proposed as "(2) for masochists", is essentially such string processing considering grapheme clusters.

But in my opinion, the grapheme stuff is completely useless to consider regarding any reasonable effort and what can be done by the MSL or Modelica tools. As I wrote:

(a) I do not know of any algorithm to decide if a sequence of graphemes annotated to a codepoint are valid -- i.e., have an interpretation in the alphabet the codepoint is part of -- and (b) I do not know of any Framework that guarantees it can render any valid combination of graphemes proper. W.r.t. (b) it would for example be required to render proper glyphs composed of many different graphemes, for example, a math symbol that is primed, underlined and overlined. (a) would be required to decide if a string is erroneous (because if it has no interpretation in its encoded alphabets, it is nonsense -- if you want to transmit arbitrary data then use byte arrays, not Unicode).

I think, we should support byte and codepoint wise access. But just ignore the grapheme blabla and life with that:

Note the bold-face last point: Breaking grapheme clusters wrongly will yield garbage -- no valid/unknown glyphs -- which is really very similar to what you get in case of an encoding error in most renderings, since encoding errors are also typically just handled by rewinding until the next valid codepoint in UTF-8/16; such rewinding typically also results in garbage rendered, similar as if you improperly break grapheme clusters.

To explain all of this by an example: The usage of ¨ in this sentence is rubbish, since the purpose of the grapheme ¨ is to modify a codepoint to render a certain glyph (user preceived character) in some alphabet. But since the text engine of GitHub does not know what to do with a space followed by ¨ (and I assume nobody knows what that's supposed to mean), it just places the ¨ as a separate glyph (although this violates its Unicode intended purpose). If you would now split the string " ¨" into " " and "¨" and then concatenate the latter to some string ending with a codepoint whose amendmend with the grapheme ¨ would make sense (the GitHub render engine knows what to do), you might be surprised of the result: a single character is formed and you can, for example, not mark the ¨ as such anymore, like you can not mark the parts of नि anymore (but you can in GitHub delete them step-wise), which is a codepoint with amended graphemes. But I honestly see no difference to an encoding error from a user perspective. Because if we have some, let's say UTF-8, encoding error, we typically get a ? in a diamond shown to denote the unknown string place. If you split that string at the encoding error and concatenate it with other broken strings, a valid UTF-8 encoding can be formed -- and surprise surprise, something magic is shown. This of course assumes that your tooling does not break just because of an UTF-8 encoding error, as we assume the GitHub render engine does not die just because we write ¨.

henrikt-ma commented 3 years ago

Good, then we wanted the same two flavors of string indexing, we just arrived at that conclusion in slightly different ways. Your view does appear more masochistic than mine, as I don't see that context-dependent ingredient in the identification of grapheme clusters. As I see it, it is irrelevant whether the application and font at hand is able to render the cluster, or if a human is able to make sense of it in some given context. Actually, your Unicode spec quote contains the claim I was looking for:

… a grapheme cluster, which can be determined programmatically.

If this claim was backed up by an implementation available to the MSL, I'd even consider (this approximation of) user-perceived characters a competitive alternative to code points. When judging the complexity of the implementation, I think we should also be prepared to allow for more complex solutions as long as they are provided through the MSL. This means that the current separation between the Modelica specification and the MSL when it comes to string handling is a really good thing that we should take care to preserve.

One thing one could actually do on the language side to better support Unicode character strings, would be to have separate types for byte strings and character strings. This would add a rather nice degree of type safety; it should be hard – ideally impossible given correctly implemented external functions – to create a String that isn't a sequence of valid grapheme clusters. Another nice effect would be that a ByteString could have a separately stored length, instead of being null-terminated.

henrikt-ma commented 3 years ago

By the way, I'd welcome an informed opinion on the choice of encoding for the external language interface:

christoff-buerger commented 3 years ago

What does the choice of encoding mean for the current dual use as both character strings and byte strings?

If you want to encode binary data (byte streams), don't use Unicode. This is a very bad misuse. In general, binary data will violate UTF-X encoding requirements and require marshalling to encode it as valid UTF-8/UTF-16/UTF-32.

Unicode is not meant for binary data. That the UTF-8 encoding scheme for Unicode is a byte-based protocol doesn't mean it is meant to encode arbitrary binary data byte-wise.

Note: Be aware of the difference between Unicode and Unicode Transfer Formats (i.e., technical encoding schemes) -- but the latter are still Unicode protocols, not intended for arbitrary binary data.

Is UTF-8 really the best option, considering that the external functions performing string operations should be able to do their jobs efficiently?

UTF-8 is the most used encoding scheme -- it basically is the standard; other encodings are close to become negligible (cf. for example https://w3techs.com/technologies/details/en-utf8).

To go for another encoding is not an option in my opinion -- it's swimming against the ocean, whatever good reasons one might find/there might be.

What does the choice of encoding mean for handling of string literals in the external code?

Considering the pervious point, I see no choice.

If you mean the choice of string manipulation API, I think the implications are all already summarised in the API examples I gave above for byte-, codepoint- and glyph-wise APIs:

christoff-buerger commented 3 years ago

Actually, your Unicode spec quote contains the claim I was looking for: "a grapheme cluster, which can be determined programmatically." If this claim was backed up by an implementation available to the MSL, I'd even consider (this approximation of) user-perceived characters a competitive alternative to code points. When judging the complexity of the implementation, I think we should also be prepared to allow for more complex solutions as long as they are provided through the MSL.

As explained above, the issue is a very complex API if it comes to memory management. I don't think the MSL can manage that; I am even not sure it is the place this can be nor should be done. At least the requirements on Modelica tools to do some good garbage collection of dynamic strings would be very high and I think Modelica toolings are not ready for this. Modelica typically only requires static sized (statically allocated) entities; the current string processing is already pushing the limits here. To now add the need to take care of even more small, dynamic entities (because every single grapheme cluster access will create a new little dynamic buffer) is pushing the boundaries.

henrikt-ma commented 3 years ago

What does the choice of encoding mean for the current dual use as both character strings and byte strings?

What I concluded myself here was that as long as we have a definition of String being passed as a null-terminated char array over the external language interface, using anything but 1 byte code units would seem like a bad idea. That is, UTF-8 seems like the only alternative as long as String is also used for null-terminated but otherwise arbitrary byte data.

If you want to encode binary data (byte streams), don't use Unicode. This is a very bad misuse. In general, binary data will violate UTF-X encoding requirements and require marshalling to encode it as valid UTF-8/UTF-16/UTF-32.

Unicode is not meant for binary data. That the UTF-8 encoding scheme for Unicode is a byte-based protocol doesn't mean it is meant to encode arbitrary binary data byte-wise.

So you are saying that one shouldn't use String for binary encoded data? Sounds great to me, if we didn't have to consider both binary encoded data and Unicode strings shoehorned into the same Modelica type!

henrikt-ma commented 3 years ago

Is UTF-8 really the best option, considering that the external functions performing string operations should be able to do their jobs efficiently?

UTF-8 is the most used encoding scheme -- it basically is the standard; other encodings are close to become negligible (cf. for example https://w3techs.com/technologies/details/en-utf8).

I know it's the standard for storing text, but that's not what the Modelica external function interface is about. The external function interface is about the in-memory representation of string data (assuming we're not mixing Unicode strings and binary data in the same type). What I'm thinking of is that a lot of the MSL Strings functions operate on character indices, and if we would go for the code point indexing alternative, this becomes a constant time addressing operation using UTF-32, but requires quite inefficient counting from the beginning of the string with UTF-8 or UTF-16. (With grapheme cluster indexing, even UTF-32 doesn't avoid the need to scan from the beginning of the string.)

The gain in indexing performance would also have to be weighed against the associated costs, like substantially increased memory requirements…

To go for another encoding is not an option in my opinion -- it's swimming against the ocean, whatever good reasons one might find/there might be.

Note that this particular item was focused on the efficiency of the external functions, but you'd still have a valid point here if it would be the case that there are some good UTF-8 processing libraries out there available to us, but no good libraries for other encodings.

When considering the efficiency of the external functions, one should also keep in mind that the answer could depend on whether the internal Modelica environment would use the same encoding or not. Perhaps one shouldn't assume that tools would align their internal representations with the external function interface, and then it would be fair to also add the costs for converting to/from the internal representation when determining the efficiency of the external string processing functions.

In summary, I think UTF-32 might have its advantages, but only under several assumptions that can all be debated:

So far, I can only agree that UTF-8 appears as the better sounding option.

henrikt-ma commented 3 years ago

What does the choice of encoding mean for handling of string literals in the external code?

Considering the pervious point, I see no choice.

What I had in mind was the availability of things like the C++11 char32_t string literals: https://en.cppreference.com/w/cpp/language/string_literal

I am not aware of something similar for C, which would be another strong argument for sticking to UTF-8 (designed to work in this kind of legacy contexts).

henrikt-ma commented 3 years ago

If you mean the choice of string manipulation API, I think the implications are all already summarised in the API examples I gave above for byte-, codepoint- and glyph-wise APIs:

Actually, your Unicode spec quote contains the claim I was looking for: "a grapheme cluster, which can be determined programmatically." If this claim was backed up by an implementation available to the MSL, I'd even consider (this approximation of) user-perceived characters a competitive alternative to code points. When judging the complexity of the implementation, I think we should also be prepared to allow for more complex solutions as long as they are provided through the MSL.

As explained above, the issue is a very complex API if it comes to memory management. I don't think the MSL can manage that; I am even not sure it is the place this can be nor should be done. At least the requirements on Modelica tools to do some good garbage collection of dynamic strings would be very high and I think Modelica toolings are not ready for this. Modelica typically only requires static sized (statically allocated) entities; the current string processing is already pushing the limits here. To now add the need to take care of even more small, dynamic entities (because every single grapheme cluster access will create a new little dynamic buffer) is pushing the boundaries.

Aren't you assuming a destructive string API here? For example, there is no function for destructively replacing one character by another. More importantly, since strings passed to external functions won't be modified, and any new strings will be constructed from scratch, why would one need a more complicated memory model than what we have today?

HansOlsson commented 3 years ago

UTF-8 is the most used encoding scheme -- it basically is the standard; other encodings are close to become negligible (cf. for example https://w3techs.com/technologies/details/en-utf8).

And to clarify: after UTF-8 you find two variants of iso-8859-1 and the Windows code page for Cyrillic; and at 0.2% a character encoding for Chinese; so basically no general character encodings alternatives to UTF-8.

However, to me the conclusion is clear: use utf-8, and return length in bytes as the alternatives would require too many decisions and does not provide a clear benefit for realistic use-cases.

(Missed negation.)

henrikt-ma commented 2 years ago

It's interesting that we've had all this discussion from the MSL perspective without even noticing that all of this is a no-issue as long as the Modelica String values are restricted to ASCII. This is why https://github.com/modelica/ModelicaSpecification/pull/3068 should be merged now even though we're trying to lift the restriction to ASCII in https://github.com/modelica/ModelicaSpecification/pull/3079.

Thanks to @beutlich for pointing out the connection.

henrikt-ma commented 2 years ago

It seems like we should pick up this discussion again, but this time with the understanding that we're discussing a potential future solution for a situation where Modelica String values contain Unicode data. Even though the character encoding used in the external language interface is ultimately a language design responsibility, it seems rather clear in this case that it is the current discussion about how to provide good string handling functions in the MSL that should drive the language specification, not the other way around.

HansOlsson commented 2 years ago

That isn't clear, and Unicode-strings are already sneaking into models - since models depend on external resources and those external resources may be in directories with Unicode-characters. That is important and has already happened, and we should try to stop progress that has already happened. Sub-string handling in models is less important.

henrikt-ma commented 2 years ago

I'm sure you mean we shouldn't try.

henrikt-ma commented 2 years ago

I want Unicode too, but I think we need to be careful about how it introduced. We have a golden opportunity now to make things right, thanks to only having allowed ASCII so far.

While using UTF-8 encoding and handling data as null-terminated byte sequences might serve as a quick fix for dealing with non-ASCII file paths in loadResource etc, we should at least be careful not to make this the normal way of operating on String values. This means that the byte-oriented functions shouldn't have the same names as the current string functions.

Considering how hard it has been to agree on how to count the length of a string, maybe every position-related string function should be explicit about what unit of counting it is using, meaning that there shouldn't be a Modelica.Utilities.Strings.length at all, but only clearly named variants such as:

Then, if we find out later that counting graphemes would be better than counting code points, we can introduce Modelica.Utilities.Strings.Grapheme.length instead of changing the semantics of one of the existing functions.

The following structure of Modelica.Utilities.Strings would then emerge:

Strings
├─ isEmpty
├─ isEqual
├─ repeat
├─ hashString
├─ Byte
   ├─ length
   ├─ substring
   ├─ compare
   ├─ count
   ├─ find
   ├─ findLast
   ├─ replace
   ├─ sort
├─ Codepoint
   ├─ length
   ├─ substring
   ├─ …
├─ Scanning
   ├─ scanToken
   ├─ …
   ├─ Advanced
      ├─ scanReal
      ├─ …

In a first release, the Codepoint package wouldn't need to be feature complete, but the important thing would be to mark that the user has to make a choice of which family of string functions to use. When a choice as been made, the user can then use an import clause to select that family of functions so that, for example, Modelica.Utilities.Strings.Byte doesn't have to be repeated all over the code.