nim-lang / RFCs

A repository for your Nim proposals.
137 stars 23 forks source link

Remove case insensitivity by changing the approach towards style insensitivity #547

Open IgorGRBR opened 10 months ago

IgorGRBR commented 10 months ago

Abstract

Instead of removing underscores and forcing every letter to lowercase, replace every occurrence of underscore followed by any-case letter with uppercase letter in snake_case identifiers, and keep camelCase identifiers as-is.

Motivation

Current style insensitivity algorithm can produce a few problems internally when trying to represent compound words that can be split more than one way in identifier names, as the internal all-lowercase identifier names are indistinguishable in certain scenarios (gameStop vs gamesTop, index vs inDEX, superb_owl vs super_bowl, island vs isLand, etc).

Description

I propose for identifiers to store their names with camelCase (or snake_case, or any other way to store separate words inside of a identifier), as a case of a letter or an underscore in the name clearly delimits the words used inside that name. This should resolve the issue presented above by allowing separate words to be represented properly.

Conversion between styles would preserve the words by forcing any letter preceded by an underscore to be uppercase when converting from snake_case to camelCase. When converting from camelCase to snake_case, just reverse the process.

Code Examples

No response

Backwards Compatibility

Macros will break if they generate an all-lowercase identifiers, but other than that I'm not aware of any other backwards-incompatible changes this would introduce to Nim.

ASVIEST commented 10 months ago

As for me better to add (again) style skins (but more generalized):

# possible syntax
import strutils

proc processIdent(s: string): string =
  var afterWhitespace = false
  for i in 0..<s.len - 1:
    if s[i] == '_' and s[i + 1] != '_': result.add s[i + 1].toUpperAscii
    elif not afterWhitespace: result.add s[i]
    afterWhitespace = s[i] == '_' and s[i + 1] != '_'
  if not afterWhitespace: result.add s[^1]

proc sameIdent(a, b: string): bool =
  processIdent(a) == processIdent(b)

import std/syntaxskins # it contains also default skins
const mySkin = initSyntaxSkin(sameIdent = sameIdent)
{.syntaxSkin: mySkin.}

It also solves problem with C wrappers (you can just disable case insensitivity for wrapper module and use case sensitivity wrapper from other module in case sensitivity manner) And also you still have the benefits of case insensitivity (in other modules) but it realy hard to implement especially executing pragmas during parsing (for style skins in general) ideally the compiler should also have support for jit compilation for vm

about interaction between modules:

# gl.nim
import std/syntaxskins
{.syntaxSkin: caseSensive.} # or wrapperDefault, etc...
const GL_FLOAT = 0x1406
type GLfloat* = float32

# code.nim
import gl
# we have default syntax skin
let x = GL_FLOAT # external idents used with it's sameIdent logic
proc test(x: GLfloat) = discard
# but non external identifiers have normal logic
import snake_case_module
var x = weirdName # in snake_case_module it is weird_name
IgorGRBR commented 10 months ago

Style skins seems like an interesting idea, but a bit irrelevant here as this RFC tries to address the issue of having multiple identifiers with different compound words that are treated as the same identifier, because the algorithm that converts them to an IR can't handle such cases.

If your snake_case_module contains both weird_name and weirdname, Nim will still treat it as the same. Sure, you could just add {.syntaxSkin: caseSensive.} into snake_case_module, but I think it would be better to make Nim behave properly by default.

metagn commented 10 months ago

Macros will break if they generate an all-lowercase identifiers, but other than that I'm not aware of any other backwards-incompatible changes this would introduce to Nim.

Isn't the suggestion to remove case insensitivity? That is definitely breaking to code which uses different casings.

ASVIEST commented 10 months ago

Perhaps you still need to add a rule: letters between two capital letters must be capitalized GlFloat -> GLFloat this is needed for something like this Gl_float -> GlFloat -> GLFloat

It solves @metagn cases newHtmlParser (with originaly newHtmlParser) -> newHTMLParser new_html_parser -> newHtmlParser -> newHTMLParser so new_html_parser match newHTMLParser

ASVIEST commented 10 months ago

Macros will break if they generate an all-lowercase identifiers, but other than that I'm not aware of any other backwards-incompatible changes this would introduce to Nim.

Isn't the suggestion to remove case insensitivity? That is definitely breaking to code which uses different casings.

It is my understanding that it about changing the register insensitivity rules, not removing it.

metagn commented 10 months ago

Sorry, I was wrong about newHTMLParser, the suggestion is to transform from snake case to case sensitive camel case, we already accept that newHTMLParser and newHtmlParser are different in camel case so they should also be different in snake case.

ASVIEST commented 10 months ago

Sorry, I was wrong about newHTMLParser, the suggestion is to transform from snake case to case sensitive camel case, we already accept that newHTMLParser and newHtmlParser are different in camel case so they should also be different in snake case.

newHTMLParser and newHtmlParser are the same (when stylecheck off)

metagn commented 10 months ago

With this proposal, they no longer become the same.

ASVIEST commented 10 months ago

Perhaps you still need to add a rule: letters between two capital letters must be capitalized GlFloat -> GLFloat this is needed for something like this Gl_float -> GlFloat -> GLFloat

It solves @metagn cases newHtmlParser (with originaly newHtmlParser) -> newHTMLParser new_html_parser -> newHtmlParser -> newHTMLParser so new_html_parser match newHTMLParser

this rule lets it be the same and should maintain differences between proposed cases.

ASVIEST commented 10 months ago

import std/syntaxskins {.syntaxSkin: caseSensive.} # or wrapperDefault, etc...

BTW, now I think that sameIdent shouldn't be in syntaxSkin, it should be another pragma like {.cmpIdent: sth.} (because it's not just about one module, it's about all it's private/public idents). And this is easier to implement than it seems, because the idents are compared in sem via getIdent (https://github.com/nim-lang/Nim/blob/devel/compiler/idents.nim)

IgorGRBR commented 10 months ago

I think it would be better to convert every uppercase letter in a continuous sequence of uppercase letters to lowercase, except the first and the last ones, since a single identifier can have many words and (for me, at least) its not clear if compiler should alternate between upper and lower cases or make everything between first and last uppercase letter uppercase and what effects that may cause.

With this logic newHTMLParser becomes newHtmlParser, GLFloat -> GlFloat, Gl_float, etc.

What do you think?

ASVIEST commented 10 months ago

Maybe it's better, but I don't see much difference. It's worth noting that this resolves this case anyway:

getISLandFrame -> getIsLandFrame
get_is_land_frame -> getIsLandFrame # match but shouldn't
# must be valid only:
get_isl_and_frame -> getIslAndFrame # no match but should

NOTE: for uppercase variant result is same this can happen, although not often.

# l - lowercase letter
# u - uppercase letter
# (l or u)* - zero and more matches
# (l or u)+ - one and more matches

# string with pattern:
l*   u+  l+  u+  l* # in any case it will be the same

Idealy, and must be camel case (And), but we can't detect it

For getISLAndFrame it working as expected

Araq commented 10 months ago

I thought about making Nim case sensitive but allowing for a .insensitive pragma that applies for imported symbols. But it looks unreasonably hard to implement with unknown effects for template instantiations etc.

IgorGRBR commented 10 months ago
getISLandFrame -> getIsLandFrame
get_is_land_frame -> getIsLandFrame # match but shouldn't
# must be valid only:
get_isl_and_frame -> getIslAndFrame # no match but should

NOTE: for uppercase variant result is same this can happen, although not often.

I'm not sure I understand your examples correctly. get_is_land_frame and getIsLandFrame match and should match (both in current state of Nim and with proposed changes from this RFC). I don't see why get_isl_and_frame and getIslAndFrame woudn't match either (after both RFC changes and lowercasing letters between 2 capitals).

For getISLAndFrame it working as expected

Get getISLAndFrame is problematic if you want to convert letters between 2 capitals to uppercase, because both it and getIslandFrame will be converted to getISLANDFrame.

ASVIEST commented 10 months ago

all ident matching and case insensitivity works via getIdent proc, if we add rules object for ident, we can

getISLandFrame

getISLandFrame this is a crazy example: words: get, ISL, and, Frame not words: get, IS, Land, Frame but we can’t determine this in any way, so we can just leave it like this

getISLAndFrame will not be converted to getISLANDFrame it will be converted to getIslAndFrame because we scan I - uppercase, get A and see that next is underscore, then going to next and trying scanning but yes with underscores it should be simpler.

metagn commented 10 months ago

This discussion flared up again because some people want to do this.

type Foo = object
const FOO = 123

Before this, tons of people complained about not being able to use underscores on their own, like __init__ or _member. This concern has not been addressed by any of the suggestions in this thread.

For any kind of distinction that style insensitivity in Nim removes, there are people that use that distinction in other languages, to the point that it supposedly stops them from using Nim. This has not changed since 8 years ago when I found out about Nim, and basically nothing has been done about it (I think the Foo vs foo distinction came before). Honestly the idea that the convenience style insensitivity provides is worth losing users over is outrageous.

But it's still a feature that was promised to be part of Nim, hence 100+ people voted to keep it in #456. Removing it wouldn't be a development of Nim, it would create a new language called "Nim with complete identifier sensitivity". Many people would prefer or feel better about the original Nim over this new language. On the other end, it'll only be one thing potential new users/companies won't care about anymore while they find 20 other reasons not to use Nim. It's not a great compromise.

I understand the desire here to find middle ground, but changing style insensitivity means people who are used to it will have to learn the more complex new rules, and only some of the people who are against the current rules will be accommodated while the rest will be equally placated and driven away due to the extra complexity.

Instead, we can look at what exactly it is that the opponents of style insensitivity want that isn't possible in current Nim, and see if any of it is incompatible with what the proponents want that current Nim allows. If I had to say, it would be like:

Opponents:

Proponents:

Notice the lack of overlap: the opponents don't care if we access the symbols with a different style than the definition as they're used to it just erroring, the proponents (probably) don't care if we define different symbols that would be considered the same under style insensitivity as, again, this would error in current Nim.


Consider this design (ignore the complexity for the Nim compiler for now):

Allow defining style sensitive different symbols.

let foo = 1
let fO_o = 2

Allow referring to them with their exact definition style.

echo foo # 1
echo fO_o # 2

If we refer to them with a different style, error only if more than 1 symbol matches the current style, otherwise allow the reference.

echo fOo # error: ambiguous between foo and fO_o
let bar = 3
echo bA_r # 3

For routines defined with different styles (probably can be generalized to any overloaded symbol), if a style is used that exactly matches some of the definitions, only consider those as overloads. If a style is used that doesn't match any definition: if every definition has the same style, consider every overload, otherwise error due to ambiguous style.

proc foo(x: int): int = x * 2
proc fO_o(x: int): int = x * 3

echo foo(1) # 2
echo fO_o(1) # 3
echo fOo(1) # error: ambiguous style between foo and fO_o

proc bar(x: int): int = x + 1
proc bar[T](x: T): T = x

echo bar(1) # 2
echo bA_r(1) # 2
echo bAr("abc") # 2

This part compromises on the style insensitivity camp because it doesn't let us swap the usage with any definition and have it still work in every case. But considering --styleCheck:usages exists maybe this isn't such a big issue. There is an alternative that makes less sense but treats each camp equally, on ambiguous style we can match every overload.

proc foo(x: int): int = x * 2
proc fO_o(x: int): int = x * 3

echo foo(1) # 2
echo fO_o(1) # 3
echo fOo(1) # error: both foo and fO_o match

proc bar(x: int): int = x + 1
proc bA_r[T](x: T): T = x # notice weaker match but different style

echo bar(1) # 2
echo bA_r(1) # 1
echo bAr(1) # 2

If this design is sound, then we have something that can make everyone happy, and we can iterate on it with regards to addressing complexity in the Nim compiler; or we can formulate other designs in the same way. But again, we can't compromise too much since Nim already promises style insensitivity and has for a long time.

Araq commented 10 months ago

This discussion flared up again because some people want to do this.

No benefits are known for SCREAMING constants, so this particular case is completely irrelevant for me personally.

ASVIEST commented 10 months ago

I personally don't like the fact that foo and fO_o are both compiled. But I think that idea of this RFC is quite logical because it’s strange to make gameStop == gamesTop. As for me we need case insensitivity but restricted where they have different meanings. The purpose of this RFC is to bring some clarity to case insensitivity. Yes maybe it need also aditional rules for equality foo and fO_o and Foo, FOO or make other rules in general.

BTW I've never had a case where when using pure Nim, insensitivity gets in the way. However when wrapping C libraries it happens quite often, e.g. OpenGL, v4l2, libnx..... And I think you could add a pragma to specify case sensitivity: on | off, the same rules would apply for a module and identifiers from that module imported from other modules as case sensitivity in a module with a pragma. I find it quite useful, plus after these C-api wrappers there is likely to be a high-level api for nim.

IgorGRBR commented 10 months ago

I also don't like the idea of having foo and fO_o in the same codebase, but mostly because in the context of Nim these things would be the same identifier, probably referring to the same thing in the same context, which is a huge code smell (that current Nim doesn't forbid anyway). And I do think that having style insensitivity for identifiers be allowed only if user's chosen style doesn't create any ambiguity when resolving identifier names is an interesting idea, but it's a bit out of scope of this RFC and is probably too complex for both camps to be happy with.

The purpose of this RFC is to highlight a problem with current implementation of style insensitivity. People, when using snake_case or camelCase, don't place underscores and uppercase letters randomly, but tend to use them as delimiters for words inside of a single identifier name. This information is simply lost when Nim parses these names into IR, and in some cases this can create ambiguity amongst 2 different identifiers. And notice that people are mostly discussing snake_case and camelCase when talking about style insensitivity, and not styles like SCREAMINGCASE or nocase, so I personally don't find any value in a system that allows me to do C_R_A_Z_Y, sIlLy stufflikethis, but doesn't let me have new_freedom and new_free_dom in the same context. I consider that for styles that Nim realisitcally supports, case insensitivity is an improper tool to achieve insensitivity amongst them.

If people find genuine issues with proposed changes that would break their workflow, I'd be happy to hear about those and have a discussion. But from what I can tell, these changes shouldn't affect anyone's code (that doesn't smell already) and shouldn't prohibit people from using their favorite style.

metagn commented 10 months ago

People, when using snake_case or camelCase, don't place underscores and uppercase letters randomly

No matter what algorithm we come up with some code will still break. In this sense, people do place underscores and uppercase letters randomly. There is also the problem that the simpler the algorithm is, the more code will break.

If you don't want foo and fO_o to compile when defined in the same module you can easily add a --styleCheck:definitions and tune it further so differences like foo and fo_o are allowed but not foo and fOo, or whatever arbitrary casing scheme you come up with. Again we are dealing with things that are straight up not possible with the current compiler, we can easily restrict it further later.

There is a case which compiles in current Nim:

# a.nim
proc foo(x: int): int = x * 2
# b.nim
proc fO_o(x: int): int = x * 3
import a, b

Trying to use either foo will error here due to overload ambiguity but changing one of them to x: T would cause a subtle behavior difference.

For this case, we can add another warning, for whenever we use a symbol when a symbol with another style was available.

These warnings need to be opt in though because they will complain that people who prefer style sensitivity are using style sensitivity.

IgorGRBR commented 10 months ago

If you don't want foo and fO_o to compile when defined in the same module you can easily add a --styleCheck:definitions and tune it further so differences like foo and fo_o are allowed but not foo and fOo, or whatever arbitrary casing scheme you come up with.

From what I understand, --styleCheck simply disallows ambiguous styles for identifiers, but it doesn't change how identifiers represented internally, so it doesn't solve the problem. The problem is that I do want foo and fOo to compile (like it currently does in Nim), but I want them to represent 2 different identifiers (unlike what it currently does), the same way I want foo and f_oo to compile and be 2 different things. Just to reiterate - the issue here is identifier equality, not what style I want to restrict my codebase to.

I don't see how this is not possible to do in current compiler, since it requires a change in a compiler part that is already there and is responsible for current behavior in the first place (unless compiler, for some reason, is built around the fact that all identifiers internally are lowercase (except maybe for the first letter) and is now tightly coupled to this implementation detail).

There is also the problem that the simpler the algorithm is, the more code will break.

If someone wants to use my fooBar as foo_bar, I see no issue with that, but what value do we gain from someone writing my fooBar as f_o_o_BaR? Yes, the proposed algorithm and proposed changes to it will break someone's code, if all they do is SCREAMINTHEIRCODEBASES, orpretendlikeunderscoresdontexist, or ins_ert u_ndersc_ore_s rand_oml_y w_it_h no th_ought or re_as_on. Nim currently supports these cases, but the question is, should it? I would be very surprised if out of 100+ people that wanted Nim to keep style insensitivity, there would be anyone who writes fooBar as f_o_o_BaR. If they do, then once again, I'd love to hear their thoughs and reasoning. All I see is a genuine problem with current approach: just because someone decided they want to turn my GL_BYTE into Glbyte in their codebase, I'm not allowed to have more direct C lib bindings, like GL_BYTE and GLbyte and watch out for compounds (which, admittedly, not a huge issue, but is still a non issue in any other programming language). Thus I want you to maybe consider "No, Nim should have more strict rules as to how identifier names are translated amongst the styles we support" as the answer.

Once again, if I'm missing something, and there is a genuine case where the proposed changes would break an otherwise reasonable code that doesn't adhere to either snake or camel case (or some mixture of both), then I'd like to see such an example (that isn't foobarbaz).

There is a case which compiles in current Nim:

This will also compile with proposed changes, but will not be ambiguous, because you would be able to call foo with foo, and call fO_o with fO_o, fOO or f_o_o.

ZoomRmc commented 10 months ago

the idea that the convenience style insensitivity provides is worth losing users over is outrageous.

The idea that the convenience style insensitivity is critical for deciding on using or ditching the language is hilarious. :shrug:

Araq commented 10 months ago

I'm sure that's the rule that has been proposed here, but let me spell it out to see if there is disagreement:

"Two identifiers are equal if they have separators in the same positions. A separator is either the underscore or a invisible between a case switch (lowercase followed by uppercase or vice versa). The first character is always case sensitive and other characters are not. An underscore cannot be followed by an underscore."

This implies:

foo != foO
fo_o != foo
fo_o == foO
NoSmoking = No_smoking
GlFloat == Gl_float
AbCD = Ab_cd
HtmlParser = Html_Parser
HTMLParser = HTMLP_arser # wrong, but that's life

This is pretty intuitive, keeps style intensivity's benefits and tooling can easily provide a transition path.

kuchta commented 10 months ago

@Araq What about "A separator is either the underscore or a invisible between a case switch" (when lowercase followed by uppercase or one left if uppercase followed by lowercase). IMHO HTMLParser is more common style then HTMLparser so that

HtmlParser == Html_Parser
HTMLParser == HTML_Parser
metagn commented 10 months ago

From what I understand, --styleCheck simply disallows ambiguous styles for identifiers

I am suggesting a new option entirely that would go with my proposal, it would just go under the styleCheck umbrella.

it doesn't change how identifiers represented internally, so it doesn't solve the problem.

Yes, internally all identifiers are stored in a style insensitive way. You are suggesting to store them under a new scheme (no?), I am saying changing the scheme is going to cause more pain.

The problem is that I do want foo and fOo to compile ... I want them to represent 2 different identifiers ... I want foo and f_oo to compile and be 2 different things

I don't want to indulge misunderstandings further here. The compiler, on a technical level, should be able to respect any different strings. The same way the compiler would allow foo and fOo or foo and f_oo to coexist, it shouldn't suddenly complain when fooBar and foo_bar coexist; we should be able to let the user decide which pairs make sense (ideally). We can do this in the frontend without changing the semantics of the language, which is why I mentioned the possibility of an option under the styleCheck umbrella. We can 1. let the user know when they defined the same thing in camel case and snake case (who would do this?), this is what I meant by --styleCheck:definitions, again we can let the user specify that for example only ambiguity between snake case and camel case is bad, with some option name like --styleCheck:definitions:snakeCamelCase; 2. let the user know when they wrote something that matches the style of one identifier but could pass for another under style insensitivity, maybe under an option name like --styleCheck:alternatives. Sorry that I rambled here, I just wanted to clear up any misunderstandings, feel free to ignore if there isn't anything to discuss

I don't see how this is not possible to do in current compiler, since it requires a change in a compiler part that is already there

I meant that you cannot define both foo and fO_o using the current compiler, "the current language" might have been more appropriate.

If someone wants to use my fooBar as foo_bar, I see no issue with that, but what value do we gain from someone writing my fooBar as f_o_o_BaR?

I don't mean to be rude, but this is not an argument for your suggestion, it's a defense under the hypothetical argument "we should allow f_o_o_BaR". The point isn't that f_o_o_BaR makes sense, it's that what makes sense was decided years or decades ago, we would just be continuing the cycle by trying to attack f_o_o_BaR. Stuff like this has plagued discussion on this issue for so many years (haha Nim allows f_o_o_BaR, haha Python errors at runtime when you don't capitalize one letter).

just because someone decided they want to turn my GL_BYTE into Glbyte in their codebase, I'm not allowed to have more direct C lib bindings, like GL_BYTE and GLbyte

I mean, this is the user's fault, there's nothing honest about changing GL_BYTE to Glbyte.


The idea that ... style insensitivity is critical for deciding on using or ditching the language is hilarious.

Yeah but are we really in a position to bargain against people who think this? We are only so influential.


A separator is either the underscore or a invisible between a case switch

I did not interpret the proposal like this at all, it's not what "convert snake case to camel case" usually means. The comment:

I think it would be better to convert every uppercase letter in a continuous sequence of uppercase letters to lowercase, except the first and the last ones

Implies that HTMLParser becomes HtmlParser = Html_parser. What I thought the proposal was initially suggesting and what each algorithm that googling "convert camel case to snake case" gives does is interpret HTMLParser as H_t_m_l_parser.

It's interesting though, and I'm not saying these conversion schemes don't make sense (if this was not clear) but I can't say that changing to a new scheme is a path we should be pursuing. To repeat, changing the scheme would:

If we are at the point that these are acceptable compromises then we can go nuts but I still have doubts.


Final note: the majority use of style insensitivity is not the simultaneous use of camel case and snake case, most people adhere to NEP1 which says to use camel case. The majority use is cases like GC_ref or openArray or builtin where it's not clear in what configuration to separate the words.

Araq commented 10 months ago

probably not make people who hate style insensitivity feel better to a significant degree

I agree but that's a (good!) argument against any proposal like this altogether. For now I'm interested in exploring the proposal, we can always reject it later. Of course, you can also say "waste of time" already.

Zectbumo commented 10 months ago

I like where this is going. Does this proposal handle all uppercase identifiers like this? NOW_HERE = NowHere NOWHERE = Nowhere I'm happy to see that NOW_HERE != NOWHERE

Araq commented 10 months ago

@kuchta > What about "A separator is either the underscore or a invisible between a case switch" (when lowercase followed by uppercase or one left if uppercase followed by lowercase).

Is that the same as "separator between uppercase letters AB if B is followed by a lowercase letter?"

IgorGRBR commented 10 months ago

I'm sure that's the rule that has been proposed here, but let me spell it out to see if there is disagreement

The ruleset is almost correct. Its just that a separator is either an underscore or a case switch from lowercase to uppercase (and not vice-versa). Also, like @metagn pointed out, a continuous sequence of uppercase letters gets converted into a separate word, except for the last character in such sequence, so people can use newHTMLParser and newHtmlParser interchangeably (but not newHTMLparser, that would be equivalent to newHtmLparser!).

Zectbumo commented 10 months ago

So AF_INET would be AfIneT?

IgorGRBR commented 10 months ago

Perhaps there should be an exception would be made for the last character in sequence if that last character happens to be last character in identifier.

So AF_INET would be AfInet.

TTSKarlsson commented 10 months ago

Maybe we could let the user decide what names the variables have, instead of having each variable name have x^n number of name permutations? It would make it easier for everyone, no inconsistencies. More power to the user.

IgorGRBR commented 9 months ago

Maybe we could let the user decide what names the variables have, instead of having each variable name have x^n number of name permutations? It would make it easier for everyone, no inconsistencies. More power to the user.

Sorry, didn't notice your comment earlier. Can you elaborate more on this? People here have already suggested using "style skins", and we already discussed their issues.

TTSKarlsson commented 9 months ago

Sorry, didn't notice your comment earlier. Can you elaborate more on this? People here have already suggested using "style skins", and we already discussed their issues.

I should probably read up on these style skins first, maybe they solve the issue already.