golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.93k stars 17.52k forks source link

proposal: permit blank (_) separator in integer number literals #28493

Closed griesemer closed 5 years ago

griesemer commented 5 years ago

This has come up before, specifically during discussions of #19308 and #28256. I am writing this down so we have a place for discussing this independently. I have no strong feelings about this proposal either way.

Proposal

This is a fully backward-compatible proposal for permitting the use of a blank (_) as a separator in number literals. Specifically, we change the integer literal syntax such that we also allow a "_" after the first digit (the change is the extra "_"):

int_lit = decimal_lit | octal_lit | hex_lit . decimal_lit = ( "1" … "9" ) { decimaldigit | "\" } . octal_lit = "0" { octaldigit | "\" } . hex_lit = "0" ( "x" | "X" ) hex_digit { hexdigit | "\" } .

And we change the floating-point number literal syntax correspondingly (the change is the extra "_"):

float_lit = decimals "." [ decimals ] [ exponent ] | decimals exponent | "." decimals [ exponent ] . decimals = decimal_digit { decimaldigit | "\" } . exponent = ( "e" | "E" ) [ "+" | "-" ] decimals .

For complex number literals the change is implied by the change to floating-point literals.

Examples:

123_456_789
1_000_000_000
0b0110_1101_1101_0010  // if we had binary integer literals
0xdead_beef
3.1415_9265

but also:

0___         // 0
0_77         // 77 decimal, or perhaps not permitted (TBD)
01_10        // 0110 (octal)
1_2__3___    // 123
0x1_23_0     // 0x1230
0__.0__e-0__ // 0.0e-0

Discussion

The notation follows more or less the syntax used in other languages (e.g., Swift). The implementation is straight-forward (a minor change to the number scanner). As the examples show, the separator, if judiciously used, may improve readability; or it may degrade it significantly. A tool (go vet) might impose restrictions on how the separator is used for readability.

ericlagergren commented 5 years ago

What would be the reason for allowing the underscore to appear anywhere instead of limiting it to (multiples of) commonly used places? E.g.,

1_000_000_000
0xDEAD_BEEF
0b0101_1111_0001

ISTM being too permissive only degrades readability.

dgryski commented 5 years ago

@ericlagergren It seems more Go-like to leave the grammar more free-form with regards to placement of _, but leave the enforcement of style (how many digits in each group) to tooling and code review.

ericlagergren commented 5 years ago

I agree. Though, it’s always possible to make the language more permissive in the future without breaking existing programs. The reverse isn’t true.

Envoyé de mon iPhone

Le 30 oct. 2018 à 16:27, Damian Gryski notifications@github.com a écrit :

@ericlagergren It seems more Go-like to leave the grammar more free-form with regards to placement of _, but leave the enforcement of style (how many digits in each group) to tooling and code review.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

griesemer commented 5 years ago

@ericlagergren What's a commonly used placement of _? That's really difficult to impossible to answer and shouldn't be decided by the language. Also, note that other languages also don't try to pin that down.

ericlagergren commented 5 years ago

Well, if you asked me, I’d say:

decimals: 3 hex: 2, 4, ... octal: ?? binary: 2, 4, 8, ...

But, I get your point. :)

Envoyé de mon iPhone

Le 30 oct. 2018 à 16:51, Robert Griesemer notifications@github.com a écrit :

@ericlagergren What's a commonly used placement of _? That's really difficult to impossible to answer and shouldn't be decided by the language. Also, note that other languages also don't try to pin that down.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

alanfo commented 5 years ago

I think this is an excellent idea if we are to have binary literals (which otherwise become unreadable after one or two bytes) and also helps with the readability of other long numbers.

Several other languages have a similar feature.

As far as placement is concerned, we'll just have to leave that to the good taste of gophers :)

davecheney commented 5 years ago

I don’t think this improves readability. For example, which is more readable?

x := 100000000 y := 100_000_000 const million = 1000000 z := 100 * million

The latter is, IMO, the more descriptive, leverages the qualities of untyped constants, the lineage of the time package, and exists today without new syntax.

alanfo commented 5 years ago

Well, exact powers of 10 can always be dealt with in some other manner. One could also do this:

z2 := int(1e8)

But what about these:

// binary literals a := 0b0101110110011011 b := 0b01011101_10011011 // bytes c := 0b0101_1101_1001_1011 // nybbles

// decimal literals d := 2861945612983 e := 2_861_945_612_983 // thousand separators

To my eyes, the underscored literals are much easier to read.

mundaym commented 5 years ago

Are there any concrete examples of existing code where this would help readability significantly? Adding a separator does make sense for big numbers but are they common enough to warrant changing the language for?

I've spent a bit of time working with Swift code and in my limited experience I've seen that underscores tend to get used inconsistently. I think they end up making it harder (for me anyway) to parse the code.

Go already has quite a few ways to write constants, I'm not sure we need more.

alanfo commented 5 years ago

When programming in other languages which have digit separators (C#, Java, Kotlin), I generally split binary literals into nybbles and add thousand separators for decimal literals >= 1,000. That more or less mimics how I'd write them down on paper (with space and comma as respective separators).

Of course, that doesn't mean I find longer numbers without separators unreadable - I'm OK up to about twice those lengths - but, when you have something, you tend to use it.

Very long decimal literals don't crop up very often (except perhaps powers of 10) but when they do it really is good to have the digit separator in the toolbox :)

If this proposal is accepted, then no doubt there will be people who abuse it or use it inconsistently but I don't really think it is practical to limit placement in some way. Other languages don't seem to bother either.

deanveloper commented 5 years ago

decimals: 3

international number formatting may be something to consider

RalphCorderoy commented 5 years ago

I think it would be better to rule out adjacent or trailing underscores, and to always want at least one digit between the base indicator and the first underscore, e.g. 0_7 is invalid, effectively ruling out a leading underscore on the value's digits.

int_lit = decimal_lit | octal_lit | hex_lit .
decimal_lit = ( "1" … "9" ) { decimal_digit | "_" decimal_digit } .
octal_lit = "0" [ octal_digit { octal_digit | "_" octal_digit } ] .
hex_lit = "0" ( "x" | "X" ) hex_digit { hex_digit | "_" hex_digit } .
griesemer commented 5 years ago

@RalphCorderoy Note that the proposed syntax already requires at least one digit after the base indicator before the first blank (_). Based on that rule, 0_7 would be decimal 7; but for this specific case of octals, I agree it should be invalid.

I'm not convinced about ruling out adjacent separators. For one, other languages permit it, and not permitting it would truly make the literal syntax quite a bit more complex. As is, it simply follows the same pattern we already have for identifiers.

Also, consider as a counter-example: 0b0110_0111_1010_1100__0011_0100_0101_1100 .

RalphCorderoy commented 5 years ago

Hi @griesemer, the double underscore for a 16-bit boundary is a good example of why it should be permitted.

With leading underscores, including octal, ruled out, is there a reason to permit trailing ones? It's allowed in identifiers, but this change is about separating digits for readability. The lexing of 42_____x can complain about the trailing _ on peeking the x because lexing as 42 and _____x instead, after much backtracking, wouldn't be valid for later parsing anyway AFAIK.

deanveloper commented 5 years ago

0_77 // 77 decimal, or perhaps not permitted (TBD)

Definitely should not be permitted. It looks like the proposed syntax prevents this though.

[...] is there a reason to permit trailing ones? It's allowed in identifiers [...]

I can't think of any reasons. Also, identifiers allow leading underscores as well, and we've ruled those out so I'm not sure if that logic applies

cznic commented 5 years ago

This is a fully backward-compatible proposal for permitting the use of a blank (_) as a separator in number literals.

I don't think this is the case. Currently 123_456 forms three valid tokens while under the proposal it would become just one.

I'm aware the compatibility guarantee is narrowed to "continues to compile", but that's not the same as "fully backwards compatible", as that should IMHO include not breaking also any of the existing tools.

magical commented 5 years ago

@cznic As you say, the compatibility guarantee only says that valid code will continue to work; it makes no particular guarantees about invalid code.

The cost of updating tools to support new language features is something that has to be considered with any language change, but that is a separate issue from backward compatibility.

rasky commented 5 years ago

While doing embedded programming, I often use hexadecimal constants and I feel that 0x0f00_0000 is a big readability improvement over 0x0f000000, as it's too easy to miss a digit.

seaskyways commented 5 years ago

1+ This feature has been implemented lately in Kotlin and has been of great help. Would love to see it in Golang!

deanveloper commented 5 years ago

I don’t think this improves readability. For example, which is more readable?

x := 100000000 y := 100_000_000 const million = 1000000 z := 100 * million

The latter is, IMO, the more descriptive, leverages the qualities of untyped constants, the lineage of the time package, and exists today without new syntax. @davecheney

For simple, small(er) numbers, this argument holds. But it starts breaking down once we start talking about numbers that are larger than a billion. I'd rather not have to look back and count 9 zeroes just to verify that 100000000 is actually a billion so that I can do 5 * billion.

The argument also starts to break down when we start to talk about numbers which aren't as simple as 100 million, for instance, a large prime number 10107476689

x := 10_107_476_689

// alternatively
const thousand = 1000
const million = thousand * 1000;
const billion = million * 1000;
y := 10 * billion + 107 * million + 476 * thousand + 689

The argument also breaks down when you're not talking about decimal notation, which has already been brought up so I won't go into depth, but it's very useful to be able to break up binary/hex into nybbles/bytes.

PS - the 100000000 in the first paragraph only contains 8 zeroes, which was done to illustrate that your solution starts to break down a bit once you have larger numbers

themeeman commented 5 years ago

I don't believe it is a good idea to enforce restrictions on where the underscores must go, as I can imagine some people preferring 1234_567_890 to 1_234_567_890. It should be down the coder to write nice looking code.

theodesp commented 5 years ago

It looks like this proposal is very similar to Python's PEP-515

Ideally, we would also like to have rules for fmt also. For example:

a := 10107476689
fmt.Printf("%#_d", a) // 10_107_476_689
fmt.Printf("%#_x", a) // '2_5a73_dad1'
deanveloper commented 5 years ago

@theodesp again I bring up international number formatting

Is there any reason to put it in fmt? We do not have comma separators in fmt today, and I think for good reason. Number formats are localized, so I'm not sure if it's a good idea for fmt. Maybe it'd be good for the text package, but we already have comma separation there, so I don't see much use

Also, the # parameter is already taken by "alternate form". For instance, it adds a leading "0x" for "%x". Not sure if you knew this or not. If you did, you forgot to put it in your outputs, and there is no "%#d". If you didn't, well I hope you learned something new 😃

theodesp commented 5 years ago

I knew it, this was just an example extension to the existing flags. I use fmt because its the most obvious choice and I don't think the underscores are part of any sort of internationalization as the scope of this proposal is totally different.

deanveloper commented 5 years ago

The main point that I was making was that separating by 3s is localized - the Indian number format does not do this. Go forces the "." separator in fmt because it's part of the language, but separating numbers by 3s is not part of the language, so I'm not sure if that's in the scope of fmt

theodesp commented 5 years ago

Ideally, we should not involve any locale-specific rules and keep it simple and relevant for only one thing - separating digits of numbers using underscore for readability.

The format rules come in handy when you want a default but not locale-aware string representation of a value based on some flags and that should be the scope of fmt.

If you want a more localizable approach then the https://godoc.org/golang.org/x/text/message#hdr-Localized_Formatting should also be able to handle rules (if any) for underscores with commas or separators or whatever. Ideally, it should ignore them for simplicity and avoiding confusion or enabled where it has meaning.

griesemer commented 5 years ago

Here's related Python PEP 515 (Underscores in Numeric Literals) with their proposed syntax plus a discussion of pros/cons: https://www.python.org/dev/peps/pep-0515/ .

deanveloper commented 5 years ago

@theodesp I was saying that we should not include in fmt because the "#_###_###" format is locale specific, where fmt should be locale-independent ideally. I think if people want number separation, they should look in the text package.


On a separate note - The kotlin proposal brings up some other good use cases that we have not discussed (translated to Go):

const creditCard = 1234_5678_9012_3456
const socialSecurity = 999_99_9999
masiulaniec commented 5 years ago

This would reduce probability of success when grepping code for a constant (grep -Rnw 100000 .), which may lead to wrong conclusions by those analyzing the codebase in a hurry.

lrewega commented 5 years ago

A tool (go vet) might impose restrictions on how the separator is used for readability.

@griesemer can you expand on how you imagine this might work? Considering the vast number of projects I've interacted with which equate go vet with law, I question the value of furthering the gap between what the compiler accepts and what linters find suspicious.

nathany commented 5 years ago

A tool (go vet) might impose restrictions on how the separator is used for readability.

In addition to the Kotlin examples @deanveloper mentioned, I've also used the separator for:

const cents = 123_45

The linters for other languages would yell at me. So I'd rather not have a vet rule and allow people use the feature in whatever way they wish.

directionless commented 5 years ago

Credit Cards are an example of unusual, but not uncommon, number separation. I have often needed 4111_1111_1111_1111, 5105_1051_0510_5100, and 3714_496353_98431 in my test suites.

RalphCorderoy commented 5 years ago

Credit cards have already been given as an example: https://github.com/golang/go/issues/28493#issuecomment-435999989 Though long, please read all the comments before posting to avoid duplicates. It also gives opportunity to thumb-up or thumb-down comments. https://github.com/golang/go/wiki/NoPlusOne

baryluk commented 5 years ago

There are situations where (especially hex and binary literals), arbitrary positioning of would be beneficial. I do have some pieces of code that interact with hardware or binary formats that do use (in D language) in some strange positions, just to indicate better the encoding. Another example could be UUID.

As for the decimals Currencies (bitcoin = 1_000_000_00, cents = 123_45) , would be a good example, use of numbers in some non Western languages/countries (most notably China 12_3456_7890_2345; and India, 10_00_00_00_00_000), postal codes, and structured serial numbers. Or numbers used for describing groups, i.e. when I decompose some problems to run on multiple shards I leave the last few digits to indicate a shard / partition, and it doesn't need to be 3. ie. id = 11_123_33, I use decimals, because these are easy to work with, track progress by a human, and type manually and sort correctly (i.e. files/directory names).

griesemer commented 5 years ago

There are plenty of good arguments as to why an implementation should not require a specific spacing of _. That said, the original proposal's syntax permits a _ anywhere after the first digit, and this seems sufficiently undesirable that we shouldn't leave it to go vet to complain, but make it invalid in the first place.

How do people feel about Python's approach ( https://www.python.org/dev/peps/pep-0515/ )? It basically allows _ between numbers. There can't be consecutive _'s, and they can't be trailing. So things like this are allowed:

123_456_789
1_000_000_000
0xdead_beef
0x_dead_beef           // _ allowed after 0x, 0b
0b0110_1101_1101_0010  // if we had binary integer literals
3.1415_9265

but things like this wouldn't be:

0___
1_2__3___
0x1__23_0
0__.0__e-0__

There is a question regarding octal numbers: Should 0_1 be permitted, and if so, is it an octal number? (Probably, analogous to 0x_dead_beef and 0b_1100_0011 where it's nice to be able to separate the prefix from the number). Note that in Python, octals use the 0o prefix.

nathany commented 5 years ago

Adjacent separators do seem useful in some cases, as per https://github.com/golang/go/issues/28493#issuecomment-434754745 0b0110_0111_1010_1100__0011_0100_0101_1100.

I don't see any reason for leading or trailing underscores in numeric literals.

As 012 is currently octal, I think that 0_12 should also be octal. Removing the separators before determining the base -- keeping the rules for separators and base prefixes orthogonal.

griesemer commented 5 years ago

@nathany Yes, that's an important point you're making: The value of a literal should be the same with or without the _.

alanfo commented 5 years ago

I'd be comfortable with disallowing leading and trailing underscores as I can't think of any use cases for them.

As for adjacent underscores, I'd be inclined to disallow them too as they're ugly and I can't think of any use cases apart from the one mentioned by @nathany which I don't think is particularly important.

As for octal literals, one could argue that 0_1 should not be allowed because '0' is the base indicator - not really a digit as such - and therefore the first digit is '1'. Of course 01_2 would be fine.

However, if it is decided to use '0o' as an alternative to (or instead of) the present leading zero, then there would be a potential inconsistency with the above argument in that 0o0_1 should be allowed (because '0' is a digit here not a base indicator) as well as 0o1. So maybe it's best to allow 0_1 after all.

Whatever is decided about octal, I agree with the important principle that the value of a literal should be the same with or without underscores.

alanfo commented 5 years ago

I've dug out Java's rules on underscores for numeric literals for comparison with those of Python.

Notice that the octal literal 0_1 is allowed and so are adjacent underscores but leading and trailing underscores are disallowed including after 0x or 0b.

Notice also that you can't place an underscore next to a decimal point which I think would be a sensible rule for Go as I can't think of any good reason why anyone would want to do that.

josharian commented 5 years ago

There's a trend in this discussion that I'd like to argue against. A few examples plucked at random from recent comments:

[snip] as I can't think of any use cases for them.

[snip] as I can't think of any good reason why anyone would want to do that

I don't see any reason for [snip]

There are lots of Go programmers, some working on unusual projects. And machine-generation of code often is easier when you do things that there's no good reason for a human to do. Unnecessary restrictions punish such cases disproportionately.

I think the formulation of the rules should be whatever is simple, clear, and general. If that means a few weird warts are accepted, that's ok. There are better places to put effort and spec length than into trying to spell out a bunch of exceptions on the grounds that we can't currently imagine a good use for them.

alanfo commented 5 years ago

@josharian

So, just to be clear, is your suggestion then that we should simply allow underscores (single or multiple) to be placed anywhere within numeric literals except, of course, as the first character (when it would become an identifier) or within a base specifier (0_x etc.)?

baryluk commented 5 years ago

I think allowing underscore to be anywhere and be repeated more than one, with exception of the first digit or base specifier, also makes parser easier.

As for the octals. I am fine with not supporting underscores in octals. They can be removed from the language as far as I am concerned.

griesemer commented 5 years ago

@baryluk Please consider the Python spec. It basically allows an optional underscore before each digit, except for the first one. That is very regular, easy to explain, trivial to parse, and does exclude patterns that one might want to disallow as a matter of style in the first place. The value of a literal remains unchanged with the _'s removed.

lrewega commented 5 years ago

If the purpose of _ is for readability, why not allow spaces as well (or instead)? E.g.:

123 456 789
1 000 000 000
0b0110 1101 1101 0010  // if we had binary integer literals
0xdead beef
3.1415 9265

I realize it conflicts with the spec as-written:

White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token.

griesemer commented 5 years ago

@lrewega

I realize it conflicts with the spec as-written:

And that is why we can't uses spaces as separators. One could probably make it work, but it would make the language much more fragile: Is f(3.1415 9265) a function call with a single number or is a comma missing (f(3.1415, 9265))? It would be harder to read, too. There's a reason why no other major language has pursued this avenue.

gopherbot commented 5 years ago

Change https://golang.org/cl/152377 mentions this issue: spec: permit underscores for grouping in numeric literals (tentative)

lrewega commented 5 years ago

@griesemer

And that is why we can't uses spaces as separators.

This proposal will require changing the spec already. I don't see how that invalidates supporting spaces.

One could probably make it work, but it would make the language much more fragile: Is f(3.1415 9265) a function call with a single number or is a comma missing (f(3.1415, 9265))? It would be harder to read, too. There's a reason why no other major language has pursued this avenue.

I... don't quite believe you. Several major languages support spaces between string literals:

foo = "hello" "world"
f("3.1415" "9265")

What's fundamentally different with numbers? I agree that no other major language has pursued supporting whitespace in numeric literals, and if this change is merely for the sake of being more like other languages, then I can accept that. If it is for readability, then I have a hard time accepting that underscores are easier to read than spaces. Whitespace can be, and is already, used widely in Go to align for the sake of readability:

const (
    small  = -        1
    middle = +        0
    big    = +100000000
)

I don't see why spaces are any worse than introducing _, compounding:

magical commented 5 years ago

@lrewega

Several major languages support spaces between string literals

Python is one such language, and many consider it to be a serious wart. It is super easy to accidentally leave a comma out of a list of strings and silently end up with the wrong data. I'm glad Go requires an explicit + to concatenate string literals.

numbers = [
    "one"
    "two",
    "three",
]
deanveloper commented 5 years ago

Is ASCII space the only space we would allow?

C is another language which allows whitespace to be used to concatenate two string constants. This fact has been abused in the underhanded C competition to hide an exploit: http://www.underhanded-c.org/_page_id_22.html

That was a bit more of a fun fact than an actual argument, but it does show how using spaces can cause subtle bugs that can be hard to spot.

griesemer commented 5 years ago

@lrewega I did say that one probably could make it work... :-)

There's a difference between _ and white space: One of the first steps during tokenization is to eliminate all white space. Note that this happens on the lexical level, for terminal grammar rules, implemented in the scanner/lexer of a compiler.

That is, 3.1415 9265 will become two tokens 3.1415 and 9265. (In contrast, 3.1415_9265 continues to be recognized as a single token.).

Now, to make 3.1415 9265 "work", we need to change the rules for operands (which we could do). But those are non-terminal grammar rules, implemented in the parser of a compiler. A (Go) Operand may now accept a sequence of BasicLits (such as 3.1415 9265 or "foo" "bar", etc.). Again possible, but it becomes fragile (see below). And then we'd have to come up with rules how to interpret such sequences. For strings it's easy, there's just concatenation. But for numbers we want scaling, etc. We'd have new kinds of errors, and so forth. And what about 0xbad f00d? How does the compiler know you meant 0xbadf00d rather than 0xbad followed by the identifier (!) f00d?. What if it's written across two lines? Now we'd have to distinguish between blanks and newlines to make sense of that. And so forth. Now that's fragile.

It becomes very quickly seriously complicated for really no good reason.

The nice thing about using _ for separation is that it can be done absolutely trivially at the lexical level; and the rest of the compiler doesn't have to see it. You're right, we do change the spec. But not all spec changes are equal. Changes to terminal grammar productions (the ones starting with a lower-case letter in the spec) mostly affect the scanner/lexer and tend to be much easier to explain and implement then changes at the higher level. Both this proposal and binary integer literals are essentially changes at this lowest lexical level. Otherwise we probably wouldn't consider them in the first place.

We're not trying to address a major shortcoming of the language here; allowing separators is a minor convenience that hopefully is used sparingly in real code. The respective implementation and documentation effort should be in line with the goal.