golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.7k stars 17.49k forks source link

proposal: spec: find a way to export uncased identifiers #22188

Open rsc opened 6 years ago

rsc commented 6 years ago

https://github.com/golang/go/issues/5763#issue-51284151 observes “It is very strange to use, say Z成本 or Jぶつける as identifiers.” In that issue we discussed potentially changing the default export rule, but as of https://github.com/golang/go/issues/5763#issuecomment-333669811, which seemed to have general agreement, we decided against that.

Even so we do want to find a way to export uncased identifiers, or at least consider ways, in order to address the original observation.

This issue is for discussion of non-breaking ways to export uncased identifiers.

ianlancetaylor commented 6 years ago

This has been discussed before, but to note it on this issue: one possible approach would be to designate a specific Unicode character, not otherwise available for use in identifiers, to designate the identifier as exported.

For example, if we use the character $, then people would write $成本 to indicate that this identifier is exported. $ is just an example, it could instead be or or pretty much anything.

This then leads to another decision point. We could treat the $ as a required part of the identifier, which then means that people will write p.$成本 to refer to the symbol from a different package. The effect is that we will see .$ everywhere; it almost becomes a token in itself. Or, we could say that the $ is only required for references within the package where the symbol is defined, and that for references from a different package the $ is implied. After all, if the $ were not there, the symbol could not be referenced from a different package anyhow. We would then have to consider the interaction with method names and interface satisfaction; there is an obvious set of rules but is it clear enough for the reader?

griesemer commented 6 years ago

In addition to what @ianlancetaylor said: There's also a third design choice (besides requiring the special Unicode character always for exported identifiers, or only inside the package that exports the identifier). The third choice is to only require the special Unicode character at the declaration; basically a marker indicating that the following (or perhaps preceding) identifier is exported. There's also an obvious drawback with this choice which is that one won't be able to tell (at a use site) whether an identifier is exported or not by simply looking at it. ( In the past we have eschewed this idea).

Regarding the choice of the special Unicode character: One could chose . (period). Inside a package p that declares the exported identifier, say 成本, one would have to write .成本. Inside a package that imports package p one would continue to write p.成本. This would be pretty natural in imported packages, and fairly unobtrusive (but still visible) in the declaring package.

bcmills commented 6 years ago

If we are contemplating a change to export rules anyway, we should evaluate whether the same mechanism could be used for fields of cgo-imported structs, as described in #13467.

As far as I can tell, the constraints to address that use-case are:

  1. When used to export all of the fields of a struct type, aliases of that struct type defined in different packages must be identical.
  2. Within a package, the use of a struct field exported by this mechanism must not require a distinguished prefix.

A distinguished Unicode character only satisfies those constraints if it is not required for field references within the same package.

lijh8 commented 6 years ago

maybe this can be solved in some other project's own coding rules

// if you don't understand the requirement and abstract the concepts very well and cant' come up with good names

// E for Export var E_成本1 double var E_成本2 double var E_成本3 double

// or use a Getter func Get_成本() double

some time some coder can't come up with a good name in English, this has nothing to do with the programming language.

variable or function names in Chinese are not used so much in source code. other part of source code are still not Chinese: if, for, func, return. you don't use variable or function names in Chinese in C, C++ too.

Davidc2525 commented 5 years ago

if they come to place $ to declare a variable, I will hate to go

kstenerud commented 4 years ago

Why not just make go 2.x be explicit, using keywords like public and private to specify visibility? "Clever" hacks upon existing systems not designed for that purpose (e.g. letter case) are what cause these kinds of messes in the first place. Adding further hacks on top of the broken ones just makes it worse.

To support legacy code, you could make fields of unspecified visibility use the old, broken behavior so that code still compiles and runs as expected.

ianlancetaylor commented 4 years ago

@kstenerud We believe that the fact that one can immediately tell by looking at an identifier whether it is exported or private is a feature. We don't consider it to be broken behavior.

In any case this is now so fundamental to Go code that it would not be feasible to change it at this point.

beoran commented 4 years ago

In Oberon * or - were used as export marks, so using, for example . as the export mark is not unprecedented. (https://cseweb.ucsd.edu/~wgg/CSE131B/oberon2.htm) I feel that the go package system as well as the imports and exports are inspired by Oberon, but with the advantage that upper case identifiers are exported "automatically". Still, I think it is an omission that there can't be an explicit export mark as well. Furthermore, for interoperation between Go and other languages, it might be desirable to export a lower case identifier as well.

The fact that one can immediately tell by looking at an identifier whether it is exported or private is only of limited use, since that is only so at the local package level. If the identifier comes from another package, normally, if not using dot imports, it is immediately clear that an identifier is imported, thanks to the package prefix. Therefore I think that the '.' prefix to signify an exported identifier like @griesemer proposes is the best solution for this problem.

dreamgonfly commented 3 years ago

Why not keyword "export"? It's readable, explicit, and understandable even for new comers. Please do not give more special characters special meanings, which is incomprehensible at first glance.

I also support @kstenerud 's opinion.

beoran commented 3 years ago

Go is partially inspired on oberon-2, but it has a more C like syntax. Otherwise we would write begin in stead of { and end in stead of }, and pointer in stead of *. So special characters with special meaning is somewhat normal for Go. .

mdempsky commented 3 years ago

With the Go language version in go.mod, I think one option here is we could just make uncased identifiers exported iff they're in a package that uses Go 1.17 (or whatever -- I'm going to use 1.17 for concreteness). That is, 成本 would continue being non-exported for packages using Go 1.16 or older (including being inaccessible from newer packages compiled using Go 1.17+), but would be exported from packages compiled using Go 1.17+ (but would similarly continue being inaccessible from older packages using Go 1.16 or older).

This would mean users upgrading from Go 1.16 to Go 1.17 might need to rename their identifiers to prevent them from being exported (e.g., rename 成本 to x成本). We'd probably want to give a substantial heads up to developers about this planned change.

I'm pretty sure the compilers and reflect API can handle this fine. I'm a little worried about go/token.IsExported; e.g., I see go/doc and net/rpc use it, but maybe they can be handled some other way.

dsnet commented 3 years ago

With the Go language version in go.mod

As someone who writes lots of tooling that parses Go source files, there's a strong benefit to being able to understand what the source code for a single Go file means without needing to consult some other piece of information (i.e., the go.mod file).

For example, ast.Ident.IsExported would be broken since it has no concept of whether it is operating under pre-Go1.17 semantics or not.

crvv commented 3 years ago

That was proposed in 2013 but rejected. I think it is a good idea. But if that won't happen, I think it's better to keep the rule. https://github.com/golang/go/issues/5763#issuecomment-66081539

95% of Chinese programmers won't to use Chinese variable name. For the rest of the 5%, I am very sure they won't use Chinese variable name in 95% of their code. It is not worth adding a marker in the variable declaration. https://github.com/golang/go/issues/5763#issuecomment-316421809

beoran commented 3 years ago

@crvv, I can see that not many Chinese Go programmers want to use Chinese function names or variable names up to now now, but perhaps this like the story of the fox that tries to eat grapes from a vine but cannot reach them? If it becomes possible, then I think we are likely to see more people who will want to use Go because now they are able to use it, as is suggested by @lych77 in #5763.

Furthermore, there is also a problem of interoperating Go code with other programming languages. In some of those programming languages, the convention is that function or method names should be all lowercase. Therefore if would be convenient for that use case to allow certain non-exported identifiers to become exported, preferably by a . marker as was suggested by @griesemer.

crvv commented 3 years ago

Yes, there are some use cases where Chinese variable name are very helpful. https://github.com/golang/go/issues/5763#issuecomment-245828791 And I think it can be solved by just making uncased identifiers exported.

https://github.com/golang/go/issues/30572#issuecomment-469381004 I agree that using case to distinguish export status is a great feature of Go. If I can write var .lower_case_variable int to make it public, the great feature will be broken.

var Upper int // public
var lower int // private
var .lower int // public
var 成本 int // private
var .成本 int //public

If the great feature is broken, why not just accept #30572 ?

jimmyfrasche commented 3 years ago

So:

Export all uncased identifiers

Export sigil only used at declaration, like .成本

Export keyword

Export sigil used only in defining package

Export sigil always used, like $成本

Would anything other than lexical checks for exported-ness break in a user-facing manner?

IsExported would only break in token/ast: the versions in go/types and reflect (soon to be added #41563) would continue to be precise. Would many tools be broken irreparably by this? Would it help to deprecate token/ast IsExported well before making a change?

Could an explicit notation be limited to uncased identifiers and cased identifiers continue to use the current rules? (:+1:)

MikkelHJuul commented 2 years ago

I don't have enough knowledge on compiler / language design to tell if this is possibly a terrible solution.

An option that has not been suggested before is "in line" with go:noinline(pun intended); to use a compiler-instruction (is that the correct word for them). fx. go:export-uncased(to make it clear that it only accepts uncased letters, or go:export and give people a compile time error if they try to compile with a cased identifier) it's not built in to the language, but neither is it lightweight though.

//go:export-uncased
func 成本() {
     //...
}

it is a means to an end, no sigils, no keywords, no change in how the language handles exporting identifiers, more typing

fumin commented 11 months ago

For CJK, I suggest the following rule that respects @rsc 's "opt-in" philosophy. Like upper and lower cases in English, CJK contains the distinction between "繁體" (cumbersome char) and "简体" (simplified char). I propose exporting stuff only when the first char is a "繁體" cumbersome char, just like English exporting Capital letters. Forcing people to use "cumbersome" letters makes sure people think twice before exporting stuff, just like the effect Capital letters has on English.

Below is a concrete example:

// 導出 means export, as in https://tour.go-zh.org/basics/3 .
// On the other hand, if you write 导出, since 导出 has obviously less strokes, it won't be exported.
var 導出 int = 0

// 私人 is "private".
// This is obvious even to an untrained American eye, as "私" has much less strokes and is less dense than "導".
var 私人 int = -1
crvv commented 11 months ago

For many Chinese syllables, the simplified is the same as the traditional. Such as 私, 人 and 繁

And in some cases you can't tell whether a syllable is traditional Chinese or simplified Chinese. For example, in traditional Chinese, there are two syllables 郁 and 鬱. These two are both written as 郁 in simplified Chinese.

bjorndm commented 11 months ago

Interestingly, lower case Roman characters are actually a simplification of upper case Roman characters, for ease of writing, so the relation is actually opposite of what is suggested here for CJK. Therefore it doesn't sound very realistic.

fumin commented 11 months ago

@bjorndm your description above

lower case Roman characters are actually a simplification of upper case Roman characters, for ease of writing

is exactly what's going in Chinese, 简体 characters are actually a simplification of 繁體 characters, for ease of writing. Do you see the 100% resemblence between Chinese and Roman in your own statement?

@crvv It's true that the relationship between traditional and simplified is a many-to-one one. Otherwise, how would you achieve a reduction in number of characters in simplified Chinese? : ) What this proposal is about is suggesting that only characters that are exclusively cumbersome Chinese are exported. You gave a perfect example with 鬱, which is one of the most absurdly cumbersome characters that even a lay American knows that it is traditional Chinese and thus exported. On the other hand, since 郁 can be used both in simplified and traditional Chinese, it is not exported. This rule is very simple and clear.

mdempsky commented 11 months ago

As someone who writes lots of tooling that parses Go source files, there's a strong benefit to being able to understand what the source code for a single Go file means without needing to consult some other piece of information (i.e., the go.mod file).

FWIW, I think we're leaving the era of where you can analyze Go source files without version information. E.g., go/types will soon provide per-file Go version information to facilitate the changing rules for for semantics.

As for go/token, we could add an API like:

// A Vocab represents the token classification rules used by a particular Go language version.
type Vocab interface {
  IsExported(name string) bool
  IsIdentifier(name string) bool
  IsKeyword(name string) bool
  Lookup(ident string) Token
}

// Go implements the token classification rules for the version of Go
// used to build the running program.
// It's the same as GoVersion(runtime.Version()).
var Go = goVocab{}

// Go1 implements the token classification rules used in Go 1.0.
var Go1 = go1Vocab{}

// GoVersion returns the classifier for the specified Go language version.
// The string must start with a prefix of the form "go%d.%d".
func GoVersion(version string) Vocab

and change the current top-level functions into deprecated wrappers for Go1.IsExported, etc.

qiulaidongfeng commented 11 months ago

If you start the identifier with_ Or lowercase characters, identifiers are considered non exported symbols.

If the identifier does not start with lowercase characters, the identifier is considered an exported symbol.

Can we solve this problem?

crvv commented 11 months ago

@fumin

There are several problems with this approch.

1. How to export a variable like 成本, which doesn't have different characters in traditional Chinese. Other words: 成本/成本 收入/收入 工资/工資 利润/利潤 企业/企業 利率/利率 Of course I definitely won't write 誠本, 壽入 or 麗率

  1. How to tell a character is traditional or simplified? Is 騬 traditional? It is the same in both traditional or simplified Chinese. But it looks like a traditional one. And in Unicode 13, there is a new code point U+31162 马乘, which is the simplified form of 騬 https://commons.wikimedia.org/wiki/File:U31162.svg Looks like the new character is a Japanese kanji.

騬 looks like public. But it may be private before unicode 13 and may become public after unicode 13. I can easily find other examples.

3. How to export Japanese words like ぶつける? The issue is about all uncased identifiers, not just Chinese.

abiriadev commented 9 months ago

@fumin, thank you for your suggestion. I would really like to see this issue resolved in a way that everyone can agree upon.

For CJK, I suggest the following rule that respects @rsc 's "opt-in" philosophy. Like upper and lower cases in English, CJK contains the distinction between "繁體" (cumbersome char) and "简体" (simplified char).

FWIW, you are only discussing C of CJK. The Korean writing system does not have any concept similar to upper or lower case. e.g as you mentioned, 简体 is a simplified version of 簡體 in Chinese. But in Korean, it's written as '간체' and there is no alternative form.

In contrast, Japanese does not use simplified Chinese characters but has Shinjitai(新字体), which serves a similar purpose to 简体, but they are not interchangeable. In fact, most Japanese people do not use 简体 at all.

In addition to these, Taiwan does not use simplified Chinese. They have only one character set.

I propose exporting stuff only when the first char is a "繁體" cumbersome char, just like English exporting Capital letters. Forcing people to use "cumbersome" letters makes sure people think twice before exporting stuff, just like the effect Capital letters has on English.

In my opinion, mixing simplified and cumbersome one does not make sense, and actually hard to type. To do this, one might have to change IME very often because many people use only one character set at a time.