sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

P1949 negative impact on math heavy code #77

Closed termi-official closed 2 years ago

termi-official commented 2 years ago

If I have understand everything correctly, then sg16 is responsible for this document, so let me elaborate an open issue with it. If sg16 is the wrong group, then please feel free to move the issue to the correct instance.

P1949 has quite a bit of side effects on code as it is written by parts of the numerical community. While I appreciate the work to improve the standard, for us numerics guys this one causes more harm than good. Let me elaborate. Pre-P1949 it was easier to write down code that is readable in a sense that close to the related theory, taking away some mental overhead. As a quick example let us take a simple time step controller, where theory states

$$ \Delta t{n+1} = \varepsilon{n+1}^{\beta1/k} \cdot \varepsilon{n}^{\beta2/k} \cdot \varepsilon{n-1}^{\beta3/k} \cdot \Delta t{n} $$

Here gcc and clang accepted pre-P1949 the following code

const auto Δtₙ₊₁ = std::pow(εₙ₊₁, β₁/k) * std::pow(εₙ, β₂/k) * std::pow(εₙ₋₁, β₃/k) * Δtₙ; 

Having this close corresponcence takes away some mental overhead when reading such codes, because we can directly relate the symbols back to the theory. Please also note that this is just a quick example and numerical codes can get way more complicated than this piece. In our internal projects such construct are quite common and I am pretty sure that there are more codes out there following such practices ( see e.g. https://github.com/llvm/llvm-project/issues/54732 ).

I appreciate the time to work on the standard, but I think this proposal is the exact opposite of what we developers in the numerical community need. And I do not like the direction this is going, because if I see it correctly then some hard problems with identifiers are note addressed yet. Now back to P1949 - probably I am missing something, but wasn't the original point of P1949 to remove invisible and control characters? I am not really seeing how super and subscripted numerical indices are related and reading through the linked issue above it also reads as it was not accidental, but intentional to remove them from identifiers, although I could not find detailed information. I also noticed that super-/subscript letters and some super-/subscript symbols seem to be still valid, causing some weird inconsistency. Can you please elaborate?

The current direction also raises more questions from my side:

  1. Are there plans to restrict allowed characters further, especially the ones used in the computational sciences/basic math notation?
  2. Is there the possibility to bring back the numerical super-/subscripts?
  3. Related to this, and I know that the unicode consortium does want to hear this, but since this is really useful for the numerical community, is there any possibility to at least allow the very basic standard letters (greek and latin) in super-/subscripts - either directly via unicode or some extra mechanism in the language/editors? Yes, I read the opinions on this (and I am absolutely not fan of it, less am I agreeing), but viewing this from user-perspective, having just some characters avaiblable is really weird.

Quick link to online code above in godbolt: https://godbolt.org/z/nG7o5K141

Thank you for taking the time.

jensmaurer commented 2 years ago

This paper was adopted by C++ last year and by C this year; the normative changes will be part of C++23 and C23, respectively. If you feel this is in error, please make sure to raise your concerns using a National Body comment via your National Standardization Body participating in ISO/IEC JTC1 SC22 WG21 or WG14.

Note that this paper defers the definition of which characters may appear in identifiers to the Unicode Consortium, who has published Unicode Standard Annex 31 with such a definition. To my understanding, the C++ committee felt that having text experts from Unicode define the set of allowable identifier characters was an improvement compared to the previous situation, where some (but not all) Emojis could have appeared as identifiers.

If you feel the specific set of characters selected by Unicode Standard Annex 31 is inadequate, the best course of action is to contact the Unicode Consortium and engage in a discussion with them.

Thanks, Jens

tahonermann commented 2 years ago

Thank you for sharing your experience with us. Your experience is not unique; we have heard from a couple of other people that maintain projects that have been impacted, including from a member of the Unicode Consortium who is currently working on improvements to Unicode to support source code as text.

As P1949 explains, the previous character allowances resulted in surprising inconsistencies and lacked a principled approach for what characters were and were not allowed in identifiers. Jens' characterization of our goals is correct; we (SG16 and WG21) do not have the expertise or resources to audit the Unicode character set in order to make our own determination of what characters should and should not be allowed in identifiers. So, we chose to defer to the Unicode Consortium and UAX #31.

We don't use the GitHub issue tracker as a medium for discussion. I encourage you to resend your post to the SG16 mailing list. Doing so will reach more people, including some members of the Unicode Consortium. I'll respond there to put you in touch with people within the Unicode Consortium for further follow up.

termi-official commented 2 years ago

Thanks for the detailed elaboration Tom and Jens. I responded to the linked mailing list https://lists.isocpp.org/sg16/2022/08/3340.php . Since this might affect several groups in the numerical community (and probably will after they upgrade their compilers), should I leave this issue open for now for visibility and as a signal when this is dealt with?

tahonermann commented 2 years ago

Thank you for the post to the SG16 mailing list. I'm going to close this issue for now. Since the concerns you raise are not C++ specific (other languages that follow Unicode guidance and UAX #31) may have similar inconsistencies or allowance variations), my preference is that the Unicode Consortium address the question of whether the characters you reported (as well as other characters that are similarly questionable) should be allowed in identifiers. I'll put you in touch with the Unicode group working on this.