Clarify whether length is in bytes or characters

w3c / baggage

Propagation format for distributed context: Baggage

https://w3c.github.io/baggage/

Other

47 stars 18 forks source link

Clarify whether length is in bytes or characters #55

Closed SergeyKanzhelev closed 4 months ago

SergeyKanzhelev commented 3 years ago

See comment: https://github.com/w3c/baggage/pull/52#pullrequestreview-630477059

We declare limits in bytes and say ASCII when we say what symbols are allowed, but if somebody will read it into the Unicode, we may want to make sure the limits will be treated as characters limits. Similar to cookie spec note:

NOTE: Despite its name, the cookie-string is actually a sequence of
   octets, not a sequence of characters.  To convert the cookie-string
   (or components thereof) into a sequence of characters (e.g., for
   presentation to the user), the user agent might wish to try using the
   UTF-8 character encoding [RFC3629] to decode the octet sequence.
   This decoding might fail, however, because not every sequence of
   octets is valid UTF-8.

dyladan commented 3 years ago

What does "if somebody will read it into the Unicode" mean?

There are several words which may be used.

characters - In my opinion, character is ambiguous without clarification. Does it refer to a single visible symbol which may be made up of multiple octets, to a single byte, or to a single octet?
byte - a byte is typically 8 bits, but in some architecture-specific cases it may be some other size
octet - an octet is always 8 bits. This is the least ambiguous word choice in my opinion.

kalyanaj commented 2 years ago

Limits have been re-written in #89 which removes this ambiguity. @SergeyKanzhelev, can you please check if that addresses your concern above?

kalyanaj commented 1 year ago

Re-assigning per the discussion in the WG meeting.

aphillips commented 5 months ago

Note that the text in #89 uses the term "character" in one location when it probably should stick with saying bytes:

greater than 8192 bytes, some list-members MAY be dropped until the resulting baggage-string is 8192 characters or less.

I agree with @dyladan's comment. Care is needed because the term "character" is overloaded. We (I18N) generally use the specialized term "code point" to refer to Unicode characters.

The limit here is given in bytes (octets). While UTF-8 is a variable width encoding, the relationship of ASCII to UTF-8 is that any 7-bit ASCII byte is itself in UTF-8. This means that the length of an ASCII string in bytes is its length in UTF-8 bytes (and its length in Unicode code points).

Decoding a sequence of non-ASCII bytes using UTF-8 would not fail, but might generate replacement characters (U+FFFD) for non-UTF-8 (and therefore non-ASCII) bytes.

In any case, it's possible to over-clarify the limit here. Measuring the limit in bytes is specific.

dyladan commented 4 months ago

Reference in the limits section to the word character was removed in #113. @aphillips @SergeyKanzhelev do you believe the current wording is sufficient to close this discussion?

aphillips commented 4 months ago

The PR #113 looks good to me.

SergeyKanzhelev commented 4 months ago

I think this can be closed