Consider creating a rule regarding rejecting invalid surrogate pairs

w3c / bp-i18n-specdev

Internationalization Best Practices for Spec Developers

https://w3c.github.io/bp-i18n-specdev/

Other

26 stars 17 forks source link

Consider creating a rule regarding rejecting invalid surrogate pairs #10

Closed stevenatkin closed 2 years ago

stevenatkin commented 8 years ago

I did not seem to find any rule about rejecting invalid surrogate pairs. We have a rule that says you should accept surrogates, but nothing that talks about whether one should reject malformed surrogates.

aphillips commented 8 years ago

I'm not sure what this refers to? #char_surrogate refers to disallowing unpaired surrogate code points. I'm not sure what an "invalid surrogate pair" is in this context (other than unpaired)??

stevenatkin commented 8 years ago

I was referring to unpaired surrogates.

r12a commented 8 years ago

actually, http://w3c.github.io/bp-i18n-specdev/#char_surrogate is saying exactly that.

if you follow the more link to the character model, it says:

Unicode contains some code points for internal use (such as noncharacters) or special functions (such as surrogate code points).

we could, of course, making the wording clearer, if needed, rather than just using the charmod wording.

stevenatkin commented 8 years ago

Maybe we can simply add a few words to make it clearer.

r12a commented 2 years ago

[from Addison]

The requested rule already existed, but there was no text provided to explain it. I added the note [4] shown here:

r12a commented 2 years ago

Suggestion:

A "surrogate code point" refers here to the use of code points in the range U+D800 through U+DFF, inclusive. These code points only exist to allow the UTF-16 encoding to address supplementary characters, and are always used in pairs. A single surrogate code point is referred to as an "unpaired surrogate" and should never be used.

I'm not sure it needs to be in a note. It's just an explanation like many others of a piece of mustard.

I think it would also improve understanding (since the explanation is not always alongside the mustard) to change the guideline to say:

Specifications MUST NOT allow the use of unpaired surrogate code point.

aphillips commented 2 years ago

I removed the "note" marker.

I think your edits make the text better, but I wanted to clarify the code points vs. code units thing here (i.e. we don't mean to ban UTF-16). Perhaps:

A "surrogate code point" refers here to the use of character values in the range U+D800 through U+DFFF inclusive. These code points are reserved to allow the UTF-16 character encoding to address supplementary characters. Surrogates are always used in pairs and only appear when the UTF-16 encoding is being used. A single surrogate code point is referred to as an "unpaired surrogate" and should never be used.

r12a commented 2 years ago

works for me

aphillips commented 2 years ago

Fixed