sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2

Open tahonermann opened 6 years ago

tahonermann commented 6 years ago

The standard library specifies a number of interfaces that cannot be made to work reasonably well for Unicode. For example, from <locale>:

Such interfaces are candidates for deprecation, replacement, and eventual removal.

cubbimew commented 6 years ago

To be fair, isupper could be trivially implemented as a test for Unicode's General_Category Lu. Of course, other C/POSIX character classes don't map to Unicode categories that well. There is ISO TR 30112:2014 (draft), which defines what POSIX classes and conversions should do for every Unicode code point, but I'd agree it isn't what a forward-looking library spec should be considering: I'd like a ctype (or a replacement code point classifier) that can tell me if a code point has General_Category Cc rather than if it is "cntrl as interpreted by TR 30112" (which doesn't actually match Cc)

rmartinho commented 6 years ago

General_Category is the wrong property, I think. Maybe it's ok for Cc (if what you want to test really is C0&C1 control characters), but definitely wrong for isupper. isupper should check Uppercase, which doesn't match gc. Always doubt yourself when you think what you need is General_Category.

cubbimew commented 6 years ago

Fair point, @rmartinho : TR 30112's definition of isupper includes non-letters with a case, such as Ⓐ Anyway, to make my comment clearer:

  1. it may be argued that a definition of those things in Unicode terms exists (will exist if TR becomes IS)
  2. a ctype/isxyz/toxyz replacement would be something that checks the category, and, as pointed out, other (all?) character properties that can be defined for a code point
dimztimz commented 6 years ago

Can't deprecate this, it's used by iostreams.

tahonermann commented 6 years ago

Can't deprecate this, it's used by iostreams.

I'm not sure what you are referring to by "this", but deprecation is not removal. We can deprecate features that are still in use.

tahonermann commented 6 years ago

Changed title to limit scope. Focus on issues currently identified and described in this issue.

cor3ntin commented 5 years ago

Here is a potential replacement http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1628r0.pdf

Note that this is low level ( doesn't mean it should't be provided as it it useful/necessary for lexers among other things ), but in general Unicode recommend this kind of things to be done on strings rather than code points both in locale independent and tailored fashion.

Exhaustive list of functions to deprecate

Name Description
isspace(std::locale) checks if a character is classified as whitespace by a locale (function template)
isblank(std::locale)(C++11) checks if a character is classified as a blank character by a locale (function template)
iscntrl(std::locale) checks if a character is classified as a control character by a locale (function template)
isupper(std::locale) checks if a character is classified as uppercase by a locale (function template)
islower(std::locale) checks if a character is classified as lowercase by a locale (function template)
isalpha(std::locale) checks if a character is classified as alphabetic by a locale (function template)
isdigit(std::locale) checks if a character is classified as a digit by a locale (function template)
ispunct(std::locale) checks if a character is classified as punctuation by a locale (function template)
isxdigit(std::locale) checks if a character is classified as a hexadecimal digit by a locale (function template)
isalnum(std::locale) checks if a character is classified as alphanumeric by a locale (function template)
isprint(std::locale) checks if a character is classified as printable by a locale (function template)
isgraph(std::locale) checks if a character is classfied as graphical by a locale (function template)
toupper(std::locale) converts a character to uppercase using the ctype facet of a locale (function template)
tolower(std::locale) converts a character to lowercase using the ctype facet of a locale
tahonermann commented 5 years ago

Exhaustive list of functions to deprecate

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

cor3ntin commented 5 years ago

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

Good question And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

tahonermann commented 5 years ago

And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

I think we should focus more on what an appropriate C replacement would look like first.

cor3ntin commented 5 years ago

Would C be interested in supporting unicode character properties?

On Sat, 3 Aug 2019 at 02:30, Tom Honermann notifications@github.com wrote:

And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

I think we should focus more on what an appropriate C replacement would look like first.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/2?email_source=notifications&email_token=AAKX766HDH3F6KGCS32IUYDQCTGSXA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PDIFA#issuecomment-517878804, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX764FIALZV2KEJMEHU6TQCTGSXANCNFSM4E34L7EQ .

tahonermann commented 5 years ago

Would C be interested in supporting unicode character properties?

No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:

The point is more that we can’t deprecate these (in C) without replacements (in C).

cor3ntin commented 5 years ago

The thing is - I'm pretty sure Unicode character properties are NOT a replacement.

Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc

Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set, a negative answer means either

Is that a useful information? Is a replacement useful? If it is we still need two api and maybe we can fix the existing one - By fixing your second and third bullet points.

On Sat, 3 Aug 2019 at 16:56, Tom Honermann notifications@github.com wrote:

Would C be interested in supporting unicode character properties?

No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:

  • improved error handling; no EOF value handling.
  • no UB on values not representable in unsigned char.
  • not code unit value based so that variable length encodings can be supported.

The point is more that we can’t deprecate these (in C) without replacements (in C).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/2?email_source=notifications&email_token=AAKX763CXOOW5ZORKYZKGPLQCWMDPA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PP4FY#issuecomment-517930519, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX76YFT2UR7AY3U2FZLNDQCWMDPANCNFSM4E34L7EQ .

tahonermann commented 5 years ago

The thing is - I'm pretty sure Unicode character properties are NOT a replacement.

I agree. They might be used in the implementation of a replacement though.

Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc

I agree, but the provided example is specifically passing a Unicode code point, so I don't think anyone would expect a locale dependency (this is not true for case mapping algorithms in general, but is for Unicode code point properties).

Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set

Technically, it supports all values that fit in a value of unsigned char (which is usually 8-bit in practice).

a negative answer means either

  • foo is not a upper case letter
  • foo is not part of this non-unicode character set

Or foo isn't a code point at all (e.g., a trailing code unit value).

A code point based interface would solve all three of the bullet points I listed. (I would be fine with passing an invalid code point, errm, scalar value being a precondition violation; long live Contracts 2.0!)