In the version string sections we have (in various places with emphasis added)
In Version String 2 sections
It provides a regular expression target for determining a serialized field map’s serialization format and size (character count) of its enclosing field map.
This length is the total number of characters in the serialization of the field map. The maximum length of a given field map serialization is thereby constrained to be 644 = 224 = 16,777,216 characters in length.
In Version String 1 sections
This length is the total number of characters in the serialization of the field map. The maximum length of a given field map serialization is thereby constrained to be 166 = 224 = 16,777,216 characters in length. For example, when the length of serialization is 384 decimal characters/bytes, the length part of the Version String has the value 000180.
In Unicode with an arbitrary encoding bytes make code points and code points combine to make graphemes. In a utf8 encoding scheme (implied for JSON and CBOR although counting characters in MGPK which is a binary encoding doesn't quite make much sense at all) most code points are 1 byte for 1 grapheme but some code points (namely those greater than 128) are turned into sequences of 2,3,4 bytes. So although most Western text will be 1byte for 1 character those in other languages might not be. Consider the Tamil:
>>> len('வணக்கம')
6
This issue will be resolved in the spec with what I think is the keripy approach (and a correct one as far as I can think it through) in that these size calculations should be the result of counting the bytes of a fully encoded serialization of the various field maps rather than the code points or graphemes. This is in accordance with the larger TLV scheme.
In the version string sections we have (in various places with emphasis added)
In Version String 2 sections
In Version String 1 sections
In Unicode with an arbitrary encoding bytes make code points and code points combine to make graphemes. In a utf8 encoding scheme (implied for JSON and CBOR although counting characters in MGPK which is a binary encoding doesn't quite make much sense at all) most code points are 1 byte for 1 grapheme but some code points (namely those greater than 128) are turned into sequences of 2,3,4 bytes. So although most Western text will be 1byte for 1 character those in other languages might not be. Consider the Tamil:
This issue will be resolved in the spec with what I think is the keripy approach (and a correct one as far as I can think it through) in that these size calculations should be the result of counting the
bytes
of a fully encoded serialization of the various field maps rather than the code points or graphemes. This is in accordance with the larger TLV scheme.https://docs.python.org/3/howto/unicode.html