sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

WG21 P1949: Improve support for Unicode characters in identifiers #48

Closed tahonermann closed 2 years ago

tahonermann commented 5 years ago

JF raised this issue on the SG16 mailing list.

Briefly, the standard allows the use of Unicode characters outside the basic source character set to be used in identifiers as specified by [lex.name]p1. The standard does not provide a rationale for the ranges of allowed characters that it specifies. It is likely that the specified ranges are not being maintained as new characters are added in new Unicode releases.

The Unicode consortium has published UAX#31, a technical report covering naming of identifiers. This document may provide a better basis for the C++ standard to base its allowances for use of Unicode characters outside the basic source character set in identifier names.

tahonermann commented 5 years ago

cppreference.com has a more informative list of the ranges of allowed characters in identifiers.

ThePhD commented 5 years ago

WG14 paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

strega-nil commented 5 years ago

It's unlikely to happen now, but if at all possible it'd be really good to NFC identifiers.

cor3ntin commented 5 years ago

@ubsan agreed, it is strongly encouraged by UAX#31

However, it is putting the cart before the horse. We only have Unicode identifier portably if the physical character set is able to represent all Unicode code-points. So before any improvement can be made in this area we need a way to ensure the compiler will treat the file in which such identifier is used as utf-8 (or some other sensible Unicode encoding, such as utf8)

tahonermann commented 5 years ago

We only have Unicode identifier portably if the physical character set is able to represent all Unicode code-points.

That is not strictly correct as identifiers can contain \u1234 escape sequences. I will not comment on the utility of such escape sequences in identifiers other than to say I've never used one outside of a test :)

So before any improvement can be made in this area we need a way to ensure the compiler will treat the file in which such identifier is used as utf-8 (or some other sensible Unicode encoding, such as utf8)

I don't agree with this conclusion. The standard is clear regarding how physical source file characters are mapped to the compiler's internal encoding. Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding. In my opinion, it is that latter case that we need to improve.

cor3ntin commented 5 years ago

Source files are portable so long as the compilers used with them 1) support the actual source file encoding

That's the definition of not portable

Interestingly Microsoft solves that particular problem by always parsing identifiers as utf8 regardless of the actual encoding of the file. That falls appart if you add reflection to the mix. At this point identifier are text and the conversion needs to be deterministic and lossless

The standard is not clear. it is completely implementation defined. Aka not portable. Agreed about 2) but having to specify utf8 it's a terrible default.

As a point of data i learned today that vcpkg build every packages on windows with /utf8

tahonermann commented 5 years ago

That's the definition of not portable

You’ll have to walk me through to that conclusion.

Interestingly Microsoft solves that particular problem by always parsing identifiers as utf8 regardless of the actual encoding of the file.

I’m not sure what you mean by that. Perhaps you mean that identifiers are transcoded from the source file encoding to UTF-8 and then used in that form? Microsoft uses UTF-8 as the internal encoding, so that doesn’t seem surprising.

The standard is not clear.

What isn’t clear?

but having to specify utf8 it's a terrible default.

I don’t disagree, but that doesn’t make it the wrong choice from a backward compatibility and migration perspective.

As a point of data i learned today that vcpkg build every packages on windows with /utf8

Yes, I’ve discussed this with Robert previously. If I recall, he had done some scans and found little use of non-ASCII characters. I don’t find that at all surprising within the Windows ecosystem though since the default source file encoding for the Microsoft compiler is locale sensitive. Programmers on Windows that distribute source files have never been able to assume an encoding other than ASCII (and even that breaks with Shift-JIS). I don’t think the vcpkg experience generalizes particularly well.

cor3ntin commented 5 years ago

I’m not sure what you mean by that. Perhaps you mean that identifiers are transcoded from the source file encoding to UTF-8 and then used in that form? Microsoft uses UTF-8 as the internal encoding, so that doesn’t seem surprising.

no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file

example provided by @ubsan https://gcc.godbolt.org/z/O0309o

cor3ntin commented 5 years ago

That's the definition of not portable What isn’t clear? Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set

Thats work in a pre-internet, mono-platform environment. I cannot trust that. Compilers do not interpret source in a consistent fashion.

[And as you mentioned that forces people to live in an ASCII only world - solution currently is to compile everything with /utf8]

cor3ntin commented 5 years ago

If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:

[é is an example, I'm not suggesting that it should be a valid variable name, i haven't studied uax 31 enough yet]

Note that this presents an interesting issue: name_of is in the ts specified to return a NTBS in the execution encoding

cor3ntin commented 5 years ago

WG14 paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

Interesting paper but this bit

I think the C and C++ standards should be silent on this whole topic. An mplementer should be able to decide whether his implementation should normalize or not, and if so which normalization form should be used, based on his understanding of the needs of his customers. The implication of that would be that users should never name different things using identifiers that would normalize to the same string, nor attempt to reference something using anything but its exact name (for example, by using a name that would normalize to the same string as the original name)

is a deal breaker for me - this make Unicode identifiers unusable with reflection, abi, etc https://unicode.org/reports/tr31/#normalization_and_case

I'm not saying people should start putting non ASCII identifiers in their interfaces but if we want to give that ability, it needs to be reliable

tahonermann commented 5 years ago

no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file

I think this conclusion is incorrect. I think what you are seeing is typical encoding confusion. In the example you provided, UTF-8 source code is being provided to the compiler, but the compiler is being told to interpret it as Windows 1252. The character in question, 🚙 (U+1F699 RECREATIONAL VEHICLE) has a UTF-8 representation of F0 9F 9A 99. In Windows 1252, this corresponds to "🚙" (U+00F0, U+0178, U+0161, U+2122). Microsoft's documentation for allowed identifiers (https://docs.microsoft.com/en-us/cpp/cpp/identifiers-cpp?view=vs-2019) lists which Unicode code points are allowed. If you cross check that list with the Unicode code points for those characters, you'll see that each one is allowed in identifiers. As for Godbolt then displaying the original Unicode character in the disassembly window, I believe that is technically a bug in Godbolt. The disassembly output is very likely Windows 1252, but is being interpreted as UTF-8.

tahonermann commented 5 years ago

Thats work in a pre-internet, mono-platform environment. I cannot trust that. Compilers do not interpret source in a consistent fashion.

I don't see how that is relevant. The claim I made is that "Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding". All compilers don't have to have the same default behavior for source files to be portable.

[And as you mentioned that forces people to live in an ASCII only world - solution currently is to compile everything with /utf8]

Please don't tell people to use /utf-8. Tell them to use /source-charset:utf-8. Otherwise, their literals will be incorrectly encoded for the run-time execution encoding. I do think programmers should be using /source-charset:utf-8 if they are using the Microsoft compiler and don't have explicit reasons not to use it, but they should not be using /utf-8!

tahonermann commented 5 years ago

If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:

These are ABI issues and outside our purview.

struct é; static_assert(is_same_v<unqualid(u8"é"), é>);

should be a valid program (unqualid is an utility that transforms a string into an identifier, part of the ongoing metaclasses work)

It isn't at all clear to me that unqualid should accept a u8 string.

Note that this presents an interesting issue: name_of is in the ts specified to return a NTBS in the execution encoding

I think that is probably what is desired almost all of the time.

tahonermann commented 5 years ago

WG14 paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

That is the paper that brought UAX#31 into the standard wording. See [lex.name]p1. Note that the paper is simultaneously WG21 N3146.

Interesting paper but this bit

The cited text pretty much matches what we just decided for file names for P1689.

this make Unicode identifiers unusable with reflection, abi, etc

I don't agree with that conclusion.

I'm not saying people should start putting non ASCII identifiers in their interfaces but if we want to give that ability, it needs to be reliable

I do want to give programmers that ability and I agree it needs to be reliable. But I think there are multiple approaches to the problem with various pros and cons and it isn't evident to me that all implementors need to solve problems the same way.

cor3ntin commented 5 years ago

On Sat, Aug 3, 2019, 11:36 PM Tom Honermann notifications@github.com wrote:

no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file

I think this conclusion is incorrect. I think what you are seeing is typical encoding confusion. In the example you provided, UTF-8 source code is being provided to the compiler, but the compiler is being told to interpret it as Windows 1252. The character in question, 🚙 (U+1F699 RECREATIONAL VEHICLE) has a UTF-8 representation of F0 9F 9A 99. In Windows 1252, this corresponds to "🚙" (U+00F0, U+0178, U+0161, U+2122). Microsoft's documentation for allowed identifiers ( https://docs.microsoft.com/en-us/cpp/cpp/identifiers-cpp?view=vs-2019) lists which Unicode code points are allowed. If you cross check that list with the Unicode code points for those characters, you'll see that each one is allowed in identifiers. As for Godbolt then displaying the original Unicode character in the disassembly window, I believe that is technically a bug in Godbolt. The disassembly output is very likely Windows 1252, but is being interpreted as UTF-8.

You might be right and it would make more sense. We would need to run more test because it seemed that the Microsoft implementation correctly filter out some Unicode whitespaces (but not all)

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/48?email_source=notifications&email_token=AAKX765LK47PAIEOT3J5CMLQCX27PA5CNFSM4HNFO43KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PWGAY#issuecomment-517956355, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX762FFASAH2PHIJA4V6TQCX27PANCNFSM4HNFO43A .

cor3ntin commented 5 years ago

On Sat, Aug 3, 2019, 11:53 PM Tom Honermann notifications@github.com wrote:

If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:

These are ABI issues and outside our purview.

struct é; static_assert(is_same_v<unqualid(u8"é"), é>);

should be a valid program (unqualid is an utility that transforms a string into an identifier, part of the ongoing metaclasses work)

It isn't at all clear to me that unqualid should accept a u8 string.

Sure, it's not necessary - given that identifiers and string literals are interpreted similarly

Note that this presents an interesting issue: name_of is in the ts specified to return a NTBS in the execution encoding

I think that is probably what is desired almost all of the time.

Not in the presence of Unicode identifiers, because name of would give you gibberish

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/48?email_source=notifications&email_token=AAKX766Z67RUDTS4E6ZNO7TQCX46PA5CNFSM4HNFO43KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PWMTI#issuecomment-517957197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX762UDNPESMTG2Z5HWPLQCX46PANCNFSM4HNFO43A .

cor3ntin commented 5 years ago

I do want to give programmers that ability and I agree it needs to be reliable. But I think there are multiple approaches to the problem with various pros and cons and it isn't evident to me that all implementors need to solve problems the same way.

Implementation specific behaviors should be a last resort. I routinely work on 3 compilers on many platforms and I need to trust my tools. Failing to provide portable solutionns leads to people restricting themseles to the portable subset which is one of the reasons why nobody currently use Unicode identifiers. A lot of people support a lot more platforms than I do.

One problem is that input methods will vary greatly between platforms and text editors so people have very little way to control that they write code in a consistent normalization form. Which call for normalization. I am also a bit uncomfortable (well, a lot) going directly against the Unicode recommendations.

For info, at a glance:

Perl 6 and python 3 normalize Perl 5 has a guideline to ask people to normalize

Go does not support combining characters but they are considering normalizing for go 2 https://github.com/golang/go/issues/27896

Rust does not support Unicode identifiers but are considering NFC when they do.

Swift does not normalize, with some issues and desire to fix which is for them technically an API break https://forums.swift.org/t/pitch-unicode-equivalence-for-swift-source/21576

Julia normalize https://github.com/JuliaLang/julia/issues/5434

D does not normalize but Walter bright note that people avoid using Unicode identifiers https://news.ycombinator.com/item?id=20320151

C# kinda requires normalization

An identifier in a conforming program must be in the canonical format defined by Unicode Normalization Form C, as defined by Unicode Standard Annex 15. The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required.

I don't know if i miss anything relevant

tahonermann commented 5 years ago

Implementation specific behaviors should be a last resort. I routinely work on 3 compilers on many platforms and I need to trust my tools. Failing to provide portable solutionns leads to people restricting themseles to the portable subset which is one of the reasons why nobody currently use Unicode identifiers. A lot of people support a lot more platforms than I do.

I think we're on the same page here. Implementation defined behavior doesn't preclude portability; sometimes it just affects the level of abstraction required.

Thanks for doing that research; that is good information. There appears to be a clear trend towards normalization, particularly in languages that didn't start off with a normalizing implementation.

tahonermann commented 4 years ago

P1949 now tracks a solution for this issue.

tahonermann commented 4 years ago

This issue is now tracked by https://github.com/cplusplus/papers/issues/688.

peter-b commented 2 years ago

This is done!

rurban commented 2 years ago

I've prepared a report and library for "C/C++ Identifier Security using Unicode Standard Annex 39", a massive improvement over TR31 alone. See https://github.com/rurban/libu8ident/blob/master/c23%2B%2Bproposal.pdf

How do I file this officially for WG21/WG14? How do I get a P number?

tahonermann commented 2 years ago

Hi @rurban. Please send a link to your proposal to the SG16 mailing list. Myself or someone else will reply with instructions for how to request a P-number and submit your proposal.