Closed griesemer closed 5 years ago
What would be the reason for allowing the underscore to appear anywhere instead of limiting it to (multiples of) commonly used places? E.g.,
1_000_000_000
0xDEAD_BEEF
0b0101_1111_0001
ISTM being too permissive only degrades readability.
@ericlagergren It seems more Go-like to leave the grammar more free-form with regards to placement of _
, but leave the enforcement of style (how many digits in each group) to tooling and code review.
I agree. Though, it’s always possible to make the language more permissive in the future without breaking existing programs. The reverse isn’t true.
Envoyé de mon iPhone
Le 30 oct. 2018 à 16:27, Damian Gryski notifications@github.com a écrit :
@ericlagergren It seems more Go-like to leave the grammar more free-form with regards to placement of _, but leave the enforcement of style (how many digits in each group) to tooling and code review.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@ericlagergren What's a commonly used placement of _? That's really difficult to impossible to answer and shouldn't be decided by the language. Also, note that other languages also don't try to pin that down.
Well, if you asked me, I’d say:
decimals: 3 hex: 2, 4, ... octal: ?? binary: 2, 4, 8, ...
But, I get your point. :)
Envoyé de mon iPhone
Le 30 oct. 2018 à 16:51, Robert Griesemer notifications@github.com a écrit :
@ericlagergren What's a commonly used placement of _? That's really difficult to impossible to answer and shouldn't be decided by the language. Also, note that other languages also don't try to pin that down.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I think this is an excellent idea if we are to have binary literals (which otherwise become unreadable after one or two bytes) and also helps with the readability of other long numbers.
Several other languages have a similar feature.
As far as placement is concerned, we'll just have to leave that to the good taste of gophers :)
I don’t think this improves readability. For example, which is more readable?
x := 100000000 y := 100_000_000 const million = 1000000 z := 100 * million
The latter is, IMO, the more descriptive, leverages the qualities of untyped constants, the lineage of the time package, and exists today without new syntax.
Well, exact powers of 10 can always be dealt with in some other manner. One could also do this:
z2 := int(1e8)
But what about these:
// binary literals a := 0b0101110110011011 b := 0b01011101_10011011 // bytes c := 0b0101_1101_1001_1011 // nybbles
// decimal literals d := 2861945612983 e := 2_861_945_612_983 // thousand separators
To my eyes, the underscored literals are much easier to read.
Are there any concrete examples of existing code where this would help readability significantly? Adding a separator does make sense for big numbers but are they common enough to warrant changing the language for?
I've spent a bit of time working with Swift code and in my limited experience I've seen that underscores tend to get used inconsistently. I think they end up making it harder (for me anyway) to parse the code.
Go already has quite a few ways to write constants, I'm not sure we need more.
When programming in other languages which have digit separators (C#, Java, Kotlin), I generally split binary literals into nybbles and add thousand separators for decimal literals >= 1,000. That more or less mimics how I'd write them down on paper (with space and comma as respective separators).
Of course, that doesn't mean I find longer numbers without separators unreadable - I'm OK up to about twice those lengths - but, when you have something, you tend to use it.
Very long decimal literals don't crop up very often (except perhaps powers of 10) but when they do it really is good to have the digit separator in the toolbox :)
If this proposal is accepted, then no doubt there will be people who abuse it or use it inconsistently but I don't really think it is practical to limit placement in some way. Other languages don't seem to bother either.
I think it would be better to rule out adjacent or trailing underscores, and to always want at least one digit between the base indicator and the first underscore, e.g. 0_7 is invalid, effectively ruling out a leading underscore on the value's digits.
int_lit = decimal_lit | octal_lit | hex_lit .
decimal_lit = ( "1" … "9" ) { decimal_digit | "_" decimal_digit } .
octal_lit = "0" [ octal_digit { octal_digit | "_" octal_digit } ] .
hex_lit = "0" ( "x" | "X" ) hex_digit { hex_digit | "_" hex_digit } .
@RalphCorderoy Note that the proposed syntax already requires at least one digit after the base indicator before the first blank (_). Based on that rule, 0_7 would be decimal 7; but for this specific case of octals, I agree it should be invalid.
I'm not convinced about ruling out adjacent separators. For one, other languages permit it, and not permitting it would truly make the literal syntax quite a bit more complex. As is, it simply follows the same pattern we already have for identifiers.
Also, consider as a counter-example: 0b0110_0111_1010_1100__0011_0100_0101_1100
.
Hi @griesemer, the double underscore for a 16-bit boundary is a good example of why it should be permitted.
With leading underscores, including octal, ruled out, is there a reason to permit trailing ones? It's allowed in identifiers, but this change is about separating digits for readability. The lexing of 42_____x
can complain about the trailing _
on peeking the x
because lexing as 42
and _____x
instead, after much backtracking, wouldn't be valid for later parsing anyway AFAIK.
0_77 // 77 decimal, or perhaps not permitted (TBD)
Definitely should not be permitted. It looks like the proposed syntax prevents this though.
[...] is there a reason to permit trailing ones? It's allowed in identifiers [...]
I can't think of any reasons. Also, identifiers allow leading underscores as well, and we've ruled those out so I'm not sure if that logic applies
This is a fully backward-compatible proposal for permitting the use of a blank (_) as a separator in number literals.
I don't think this is the case. Currently 123_456
forms three valid tokens while under the proposal it would become just one.
I'm aware the compatibility guarantee is narrowed to "continues to compile", but that's not the same as "fully backwards compatible", as that should IMHO include not breaking also any of the existing tools.
@cznic As you say, the compatibility guarantee only says that valid code will continue to work; it makes no particular guarantees about invalid code.
The cost of updating tools to support new language features is something that has to be considered with any language change, but that is a separate issue from backward compatibility.
While doing embedded programming, I often use hexadecimal constants and I feel that 0x0f00_0000 is a big readability improvement over 0x0f000000, as it's too easy to miss a digit.
1+ This feature has been implemented lately in Kotlin and has been of great help. Would love to see it in Golang!
I don’t think this improves readability. For example, which is more readable?
x := 100000000 y := 100_000_000 const million = 1000000 z := 100 * million
The latter is, IMO, the more descriptive, leverages the qualities of untyped constants, the lineage of the time package, and exists today without new syntax. @davecheney
For simple, small(er) numbers, this argument holds. But it starts breaking down once we start talking about numbers that are larger than a billion. I'd rather not have to look back and count 9 zeroes just to verify that 100000000
is actually a billion so that I can do 5 * billion
.
The argument also starts to break down when we start to talk about numbers which aren't as simple as 100 million, for instance, a large prime number 10107476689
x := 10_107_476_689
// alternatively
const thousand = 1000
const million = thousand * 1000;
const billion = million * 1000;
y := 10 * billion + 107 * million + 476 * thousand + 689
The argument also breaks down when you're not talking about decimal notation, which has already been brought up so I won't go into depth, but it's very useful to be able to break up binary/hex into nybbles/bytes.
PS - the 100000000
in the first paragraph only contains 8 zeroes, which was done to illustrate that your solution starts to break down a bit once you have larger numbers
I don't believe it is a good idea to enforce restrictions on where the underscores must go, as I can imagine some people preferring 1234_567_890
to 1_234_567_890
. It should be down the coder to write nice looking code.
It looks like this proposal is very similar to Python's PEP-515
Ideally, we would also like to have rules for fmt also. For example:
a := 10107476689
fmt.Printf("%#_d", a) // 10_107_476_689
fmt.Printf("%#_x", a) // '2_5a73_dad1'
@theodesp again I bring up international number formatting
Is there any reason to put it in fmt? We do not have comma separators in fmt today, and I think for good reason. Number formats are localized, so I'm not sure if it's a good idea for fmt. Maybe it'd be good for the text package, but we already have comma separation there, so I don't see much use
Also, the # parameter is already taken by "alternate form". For instance, it adds a leading "0x" for "%x". Not sure if you knew this or not. If you did, you forgot to put it in your outputs, and there is no "%#d". If you didn't, well I hope you learned something new 😃
I knew it, this was just an example extension to the existing flags. I use fmt because its the most obvious choice and I don't think the underscores are part of any sort of internationalization as the scope of this proposal is totally different.
The main point that I was making was that separating by 3s is localized - the Indian number format does not do this. Go forces the "." separator in fmt because it's part of the language, but separating numbers by 3s is not part of the language, so I'm not sure if that's in the scope of fmt
Ideally, we should not involve any locale-specific rules and keep it simple and relevant for only one thing - separating digits of numbers using underscore for readability.
The format rules come in handy when you want a default but not locale-aware string representation of a value based on some flags and that should be the scope of fmt.
If you want a more localizable approach then the https://godoc.org/golang.org/x/text/message#hdr-Localized_Formatting should also be able to handle rules (if any) for underscores with commas or separators or whatever. Ideally, it should ignore them for simplicity and avoiding confusion or enabled where it has meaning.
Here's related Python PEP 515 (Underscores in Numeric Literals) with their proposed syntax plus a discussion of pros/cons: https://www.python.org/dev/peps/pep-0515/ .
@theodesp I was saying that we should not include in fmt
because the "#_###_###" format is locale specific, where fmt
should be locale-independent ideally. I think if people want number separation, they should look in the text
package.
On a separate note - The kotlin proposal brings up some other good use cases that we have not discussed (translated to Go):
const creditCard = 1234_5678_9012_3456
const socialSecurity = 999_99_9999
This would reduce probability of success when grepping code for a constant (grep -Rnw 100000 .
), which may lead to wrong conclusions by those analyzing the codebase in a hurry.
A tool (go vet) might impose restrictions on how the separator is used for readability.
@griesemer can you expand on how you imagine this might work? Considering the vast number of projects I've interacted with which equate go vet
with law, I question the value of furthering the gap between what the compiler accepts and what linters find suspicious.
A tool (go vet) might impose restrictions on how the separator is used for readability.
In addition to the Kotlin examples @deanveloper mentioned, I've also used the separator for:
const cents = 123_45
The linters for other languages would yell at me. So I'd rather not have a vet rule and allow people use the feature in whatever way they wish.
Credit Cards are an example of unusual, but not uncommon, number separation. I have often needed 4111_1111_1111_1111
, 5105_1051_0510_5100
, and 3714_496353_98431
in my test suites.
Credit cards have already been given as an example: https://github.com/golang/go/issues/28493#issuecomment-435999989 Though long, please read all the comments before posting to avoid duplicates. It also gives opportunity to thumb-up or thumb-down comments. https://github.com/golang/go/wiki/NoPlusOne
There are situations where (especially hex and binary literals), arbitrary positioning of would be beneficial. I do have some pieces of code that interact with hardware or binary formats that do use (in D language) in some strange positions, just to indicate better the encoding. Another example could be UUID.
As for the decimals Currencies (bitcoin = 1_000_000_00, cents = 123_45) , would be a good example, use of numbers in some non Western languages/countries (most notably China 12_3456_7890_2345; and India, 10_00_00_00_00_000), postal codes, and structured serial numbers. Or numbers used for describing groups, i.e. when I decompose some problems to run on multiple shards I leave the last few digits to indicate a shard / partition, and it doesn't need to be 3. ie. id = 11_123_33
, I use decimals, because these are easy to work with, track progress by a human, and type manually and sort correctly (i.e. files/directory names).
There are plenty of good arguments as to why an implementation should not require a specific spacing of _
. That said, the original proposal's syntax permits a _
anywhere after the first digit, and this seems sufficiently undesirable that we shouldn't leave it to go vet
to complain, but make it invalid in the first place.
How do people feel about Python's approach ( https://www.python.org/dev/peps/pep-0515/ )? It basically allows _
between numbers. There can't be consecutive _
's, and they can't be trailing. So things like this are allowed:
123_456_789
1_000_000_000
0xdead_beef
0x_dead_beef // _ allowed after 0x, 0b
0b0110_1101_1101_0010 // if we had binary integer literals
3.1415_9265
but things like this wouldn't be:
0___
1_2__3___
0x1__23_0
0__.0__e-0__
There is a question regarding octal numbers: Should 0_1
be permitted, and if so, is it an octal number? (Probably, analogous to 0x_dead_beef
and 0b_1100_0011
where it's nice to be able to separate the prefix from the number). Note that in Python, octals use the 0o
prefix.
Adjacent separators do seem useful in some cases, as per https://github.com/golang/go/issues/28493#issuecomment-434754745 0b0110_0111_1010_1100__0011_0100_0101_1100
.
I don't see any reason for leading or trailing underscores in numeric literals.
As 012
is currently octal, I think that 0_12
should also be octal. Removing the separators before determining the base -- keeping the rules for separators and base prefixes orthogonal.
@nathany Yes, that's an important point you're making: The value of a literal should be the same with or without the _
.
I'd be comfortable with disallowing leading and trailing underscores as I can't think of any use cases for them.
As for adjacent underscores, I'd be inclined to disallow them too as they're ugly and I can't think of any use cases apart from the one mentioned by @nathany which I don't think is particularly important.
As for octal literals, one could argue that 0_1
should not be allowed because '0'
is the base indicator - not really a digit as such - and therefore the first digit is '1'
. Of course 01_2
would be fine.
However, if it is decided to use '0o'
as an alternative to (or instead of) the present leading zero, then there would be a potential inconsistency with the above argument in that 0o0_1
should be allowed (because '0'
is a digit here not a base indicator) as well as 0o1
. So maybe it's best to allow 0_1
after all.
Whatever is decided about octal, I agree with the important principle that the value of a literal should be the same with or without underscores.
I've dug out Java's rules on underscores for numeric literals for comparison with those of Python.
Notice that the octal literal 0_1
is allowed and so are adjacent underscores but leading and trailing underscores are disallowed including after 0x
or 0b
.
Notice also that you can't place an underscore next to a decimal point which I think would be a sensible rule for Go as I can't think of any good reason why anyone would want to do that.
There's a trend in this discussion that I'd like to argue against. A few examples plucked at random from recent comments:
[snip] as I can't think of any use cases for them.
[snip] as I can't think of any good reason why anyone would want to do that
I don't see any reason for [snip]
There are lots of Go programmers, some working on unusual projects. And machine-generation of code often is easier when you do things that there's no good reason for a human to do. Unnecessary restrictions punish such cases disproportionately.
I think the formulation of the rules should be whatever is simple, clear, and general. If that means a few weird warts are accepted, that's ok. There are better places to put effort and spec length than into trying to spell out a bunch of exceptions on the grounds that we can't currently imagine a good use for them.
@josharian
So, just to be clear, is your suggestion then that we should simply allow underscores (single or multiple) to be placed anywhere within numeric literals except, of course, as the first character (when it would become an identifier) or within a base specifier (0_x
etc.)?
I think allowing underscore to be anywhere and be repeated more than one, with exception of the first digit or base specifier, also makes parser easier.
As for the octals. I am fine with not supporting underscores in octals. They can be removed from the language as far as I am concerned.
@baryluk Please consider the Python spec. It basically allows an optional underscore before each digit, except for the first one. That is very regular, easy to explain, trivial to parse, and does exclude patterns that one might want to disallow as a matter of style in the first place. The value of a literal remains unchanged with the _'s removed.
If the purpose of _
is for readability, why not allow spaces as well (or instead)? E.g.:
123 456 789
1 000 000 000
0b0110 1101 1101 0010 // if we had binary integer literals
0xdead beef
3.1415 9265
I realize it conflicts with the spec as-written:
White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token.
@lrewega
I realize it conflicts with the spec as-written:
And that is why we can't uses spaces as separators. One could probably make it work, but it would make the language much more fragile: Is f(3.1415 9265)
a function call with a single number or is a comma missing (f(3.1415, 9265)
)? It would be harder to read, too. There's a reason why no other major language has pursued this avenue.
Change https://golang.org/cl/152377 mentions this issue: spec: permit underscores for grouping in numeric literals (tentative)
@griesemer
And that is why we can't uses spaces as separators.
This proposal will require changing the spec already. I don't see how that invalidates supporting spaces.
One could probably make it work, but it would make the language much more fragile: Is
f(3.1415 9265)
a function call with a single number or is a comma missing (f(3.1415, 9265)
)? It would be harder to read, too. There's a reason why no other major language has pursued this avenue.
I... don't quite believe you. Several major languages support spaces between string literals:
foo = "hello" "world"
f("3.1415" "9265")
What's fundamentally different with numbers? I agree that no other major language has pursued supporting whitespace in numeric literals, and if this change is merely for the sake of being more like other languages, then I can accept that. If it is for readability, then I have a hard time accepting that underscores are easier to read than spaces. Whitespace can be, and is already, used widely in Go to align for the sake of readability:
const (
small = - 1
middle = + 0
big = +100000000
)
I don't see why spaces are any worse than introducing _
, compounding:
_
greatly reduces the grep
-ability of magic numbers_
s_
@lrewega
Several major languages support spaces between string literals
Python is one such language, and many consider it to be a serious wart. It is super easy to accidentally leave a comma out of a list of strings and silently end up with the wrong data. I'm glad Go requires an explicit +
to concatenate string literals.
numbers = [
"one"
"two",
"three",
]
Is ASCII space the only space we would allow?
C is another language which allows whitespace to be used to concatenate two string constants. This fact has been abused in the underhanded C competition to hide an exploit: http://www.underhanded-c.org/_page_id_22.html
That was a bit more of a fun fact than an actual argument, but it does show how using spaces can cause subtle bugs that can be hard to spot.
@lrewega I did say that one probably could make it work... :-)
There's a difference between _
and white space: One of the first steps during tokenization is to eliminate all white space. Note that this happens on the lexical level, for terminal grammar rules, implemented in the scanner/lexer of a compiler.
That is, 3.1415 9265
will become two tokens 3.1415
and 9265
. (In contrast, 3.1415_9265
continues to be recognized as a single token.).
Now, to make 3.1415
9265
"work", we need to change the rules for operands (which we could do). But those are non-terminal grammar rules, implemented in the parser of a compiler. A (Go) Operand may now accept a sequence of BasicLits (such as 3.1415
9265
or "foo"
"bar"
, etc.). Again possible, but it becomes fragile (see below). And then we'd have to come up with rules how to interpret such sequences. For strings it's easy, there's just concatenation. But for numbers we want scaling, etc. We'd have new kinds of errors, and so forth. And what about 0xbad
f00d
? How does the compiler know you meant 0xbadf00d
rather than 0xbad
followed by the identifier (!) f00d
?. What if it's written across two lines? Now we'd have to distinguish between blanks and newlines to make sense of that. And so forth. Now that's fragile.
It becomes very quickly seriously complicated for really no good reason.
The nice thing about using _ for separation is that it can be done absolutely trivially at the lexical level; and the rest of the compiler doesn't have to see it. You're right, we do change the spec. But not all spec changes are equal. Changes to terminal grammar productions (the ones starting with a lower-case letter in the spec) mostly affect the scanner/lexer and tend to be much easier to explain and implement then changes at the higher level. Both this proposal and binary integer literals are essentially changes at this lowest lexical level. Otherwise we probably wouldn't consider them in the first place.
We're not trying to address a major shortcoming of the language here; allowing separators is a minor convenience that hopefully is used sparingly in real code. The respective implementation and documentation effort should be in line with the goal.
This has come up before, specifically during discussions of #19308 and #28256. I am writing this down so we have a place for discussing this independently. I have no strong feelings about this proposal either way.
Proposal
This is a fully backward-compatible proposal for permitting the use of a blank (_) as a separator in number literals. Specifically, we change the integer literal syntax such that we also allow a "_" after the first digit (the change is the extra "_"):
And we change the floating-point number literal syntax correspondingly (the change is the extra "_"):
For complex number literals the change is implied by the change to floating-point literals.
Examples:
but also:
Discussion
The notation follows more or less the syntax used in other languages (e.g., Swift). The implementation is straight-forward (a minor change to the number scanner). As the examples show, the separator, if judiciously used, may improve readability; or it may degrade it significantly. A tool (go vet) might impose restrictions on how the separator is used for readability.