morganstanley / hobbes

A language and an embedded JIT compiler
http://hobbes.readthedocs.io/
Apache License 2.0
1.16k stars 105 forks source link

Disallow redefine names in the same scope of do/let #441

Open mo-xiaoming opened 1 year ago

mo-xiaoming commented 1 year ago

for the moment, following code compiles

do {
  a = 3;
  a = 42L;
  print(a); // outputs 42L
}

This behavior is weird for a functional programming lang

(edited) and for let expression

let
  a = 3;
  a = 42L;
in
  print(a); // outputs 42L

This should be allowed neither

mo-xiaoming commented 1 year ago

There are other possibilities for it to go wrong

let
  a = (2, 3);
  (x, a) = a; // rebind a
in
  print(a); // outputs 3

and

let
  a = {x=3, y=4};
  {x=p, y=a} = a; // rebind a
in
  print(a); // outputs 4

Being a Haskell-alike lang, those behaviors should not be allowed.

But fixing them takes more work

smunix commented 1 year ago

Let bindings are allowed to shadow variables in the most functional of all FPs: Selected-Region 2022-10-06 11:36:52 However, there's a subtle (known) bug in pattern bindings that you may want to investigate further

mo-xiaoming commented 1 year ago

But in the examples I gave, they are in the same let clause let a = 3; a = 4 ... not nested ones like let a = 3 in let a = 4.... The latter is clearly name shadowing (keyword let and different scopes), the former one is rather confusing to say the least

smunix commented 1 year ago

Don't let do notations or let block déclarations fool you, they're eventually rewritten as a series of nested let expressions.

do { a = 1; a = 2; print(a); } is morally equivalent to let a = 1 in let a = 2 in let _ = print(a) in ()

Note, the variable _ can be the subject of multiple let bindings: do { print(2); print (3); } becomes let _ = print(2) in let _ = print(3) in (). There's nothing wrong in this

mo-xiaoming commented 1 year ago

Don't let do notations or let block déclarations fool you That's my beef with it, it fooled more than me

I was aware of this rewriting to nested let behavior, after going through yacc file. However, we cannot and should not expect other people to understand this by reading parser code

do { a = 3; a = 4; } looks like a reassign or rebind, and it turns out to be neither.

Based on the principle of least astonishment, we have some options,

  1. If this behavior is designed on purpose and well defined, not just a side affect, then at least we should document the design decision well somewhere, and keep fingers crossed that people will find it and read it
  2. Get the compiler panic on this, because it might do what the programmer wants it to do, but the reason could be wrong. Because normally it is a logic error somewhere in his/her code
smunix commented 1 year ago

Hobbes has no notion of rebind, so I'm not sure where you're pulling this from. The doc only mentions local variable introductions, which is what is happening here. My guess is, those getting fooled by this notation are expecting the semantics of the language to be the same as that of C++. This is wrong, and people should read the doc of the language instead of making assumptions on how it's supposed to work.

Does the following also look illegitimate to you? do { a = 1; (do { a = 2; print a;}); print a; }? It's no different that than the notations you're trying to block in the above. It is perfectly legit to have different environments reusing the same name. They are in different environments after all. In imperative languages, they call these blocks, and use { ... } to delimit them. In Hobbes, we'd use ( ... ) as shown above ^. This is done in this way, because there are no statements in hobbes, only expressions that evaluate to values (same as in Haskell, Rust, ...).

Also, note that your current change addresses this issue only from the perspective of the parser. There are more ways (especially using the C++ bindings) one can build let bindings and come with local definitions that'd bind the same name in nested environments as in the above. RPC can also be used to build these very same let bindings. In a sense, this change doesn't work.

One more thing, the title of the ticket implies that there's a notion of "the same scope on do or let" in Hobbes. This is not true and it is misleading as, otherwise, these scopes would be somehow destructive. Each introduction of a new binding creates a totally new scope, different from the previous one. It's described in the doc in the section "local variable definitions". It may be that the doc isn't clear enough, so it should be improved (do notations aren't documented, so they need to be added).

I did mention earlier a more subtle (serious bug) issue in hobbes with the handling of pattern bindings, I actually did open a few tickets, here's one of them:

The pattern matching engine is a critical piece of Hobbes, and the above 2 issues are the closest to what you're trying to fix here. I believe that, fixing them will also address the let (x,a) = a... issue you pointed at earlier

mo-xiaoming commented 1 year ago

Hobbes has no notion of rebind, so I'm not sure where you're pulling this from. That's the problem, in C++/Java/Rust/Haskell, a {} defines a scope, { a= x; a = y; } is a rebind/redefine/reassign/re..., doesn't matter language itself allows it or not. I put this in title because it looks like name rebind in the same scope (I was aware of nested scopes get created at the time, but I still chose this very title because that's what it looks like)

But how could a hobbes programmer expect names in a simple {} creates nested blocks? Hobbes has different rules, names in {} creates nested scope, not rebinding/re..., doesn't matter how much it looks like, that's creates surprises, surprises are bad unless for a really really good reason. in C++/C/Java, same name cannot be reused in {}, but it is ok in { { } }, Haskell is no exception. (Rust allows its, but hobbes is not rust-alike)

If there's a good reason to be so, then it should be better documented. If not, then this behavior should be changed

mo-xiaoming commented 1 year ago

As for the Local Variable section in doc. It uses an let example, we understand whatever names defines in let are only valid till the end of let .. in expression, because it has the same behavior as Haskell

However, the problem we're talking about is more like this, still in REPL

> a = 3
> a = 4
stdin:1,5-5: Variable already defined: a

I know REPL works differently, but people are expecting "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably a duck"

How do we expect others to spend time on understanding this subtlety for a simple =?

smunix commented 1 year ago

Hobbes has no notion of rebind, so I'm not sure where you're pulling this from. That's the problem, in C++/Java/Rust/Haskell, a {} defines a scope, { a= x; a = y; } is a rebind/redefine/reassign/re..., doesn't matter language itself allows it or not. I put this in title because it looks like name rebind in the same scope (I was aware of nested scopes get created at the time, but I still chose this very title because that's what it looks like)

But how could a hobbes programmer expect names in a simple {} creates nested blocks? Hobbes has different rules, names in {} creates nested scope, not rebinding/re..., doesn't matter how much it looks like, that's creates surprises, surprises are bad unless for a really really good reason. in C++/C/Java, same name cannot be reused in {}, but it is ok in { { } }, Haskell is no exception. (Rust allows its, but hobbes is not rust-alike)

If there's a good reason to be so, then it should be better documented. If not, then this behavior should be changed

What you say about Haskell isn't true! {...} in haskell delimits a syntactic block. Curly braces introduce no new variable bindings at all, let alone, they don't even define a scope in Haskell. So I'm unsure what makes you say "Haskell is no exception".

Hobbes is inspired from Haskell, not C, not Java, and not C++. Again, it's important to read documentation, and not make assumption that everything that uses curly braces quacks, then it's [like] C++.

smunix commented 1 year ago

As for the Local Variable section in doc. It uses an let example, we understand whatever names defines in let are only valid till the end of let .. in expression, because it has the same behavior as Haskell

Hopefully you do! And if so, then it should be easy to explain what happens in the REPL. When you let a = 1 in a + 1, you are introducing a new variable a and binding it to the value 1. At the end of the let expression, you can't use a as a replacement for 1 anymore. In a sense, the scope has starting point and an ending point with let expressions.

When in the REPL, what would that scope be? When you first write a = 3 in the repl, you are really saying "From now on, a is 3". In Hobbes REPL, this is true for the rest of your REPL session (that is, until you shut it down). Haskell GHCi allows rebinding a to a different value, but Hobbes doesn't. Two design choices that work, and I can understand both sides.

However, the problem we're talking about is more like this, still in REPL

> a = 3
> a = 4
stdin:1,5-5: Variable already defined: a

I know REPL works differently, but people are expecting "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably a duck"

How do we expect others to spend time on understanding this subtlety for a simple =?

Document it in simpler words that = is not assignment at all. It's equivalent to a proposition. They should read a = 1 as variable a binds to the value 1, or a is 1. Saying a is 1 and a is 2 is absurd, at least in propositional logic.

If you want to assign a new value, then you have the <- operator. The following should then work (if you disable checks for unsafe calls):

a = newPrim() :: int 
a <- 2
a <- 3
mo-xiaoming commented 1 year ago
foo = \_ -> let a = 3
                a = 4
            in a + a

reports

<source>:1:17: error:
    Conflicting definitions for `a'
    Bound at: <source>:1:17
              <source>:2:17
  |
1 | foo = \_ -> let a = 3
  |                 ^^^^^...

in Haskell. Even for people have c++/c/java background expect this behavior

In hobbes,

foo = \_.let a = 3;
             a = 4;
         in a + a

foo() gives 8

I might argue there are some similarities in those code snippets, but surprisingly differently behaviors.

I'm not saying hobbes IS haskell, despite multiple places in documentation mentions haskell. I'm saying hobbes should have the same behavior as haskell

But at least, we should limits language surprises as much as possible. Just like you said

Saying a is 1 and a is 2 is absurd, at least in propositional logic.

Then at least, we should let compiler panic and tell programmer his/her code probably has logic error, error message can be something like

name conflicts, rebinding not allowed in hobbes
1. if you want to reassign, please use <- instead of =
2. otherwise, please use different names to avoid confusion
smunix commented 1 year ago

The following works in Haskell, and that's the simplest behavior for let that Hobbes is also implementing

ghci> foo = \_ -> let a = 1 in let a = 4 in a + a
ghci> foo ()
8

Like I said earlier, Hobbes rewrite all let expressions in the above form, and there's nothing wrong with this. Haskell doesn't always do that, as it also supports let constructs for various cases (monadic, comprehensions, let blocks, ... all of them are different things in Haskell).

There are various ways to implement let expressions, and some of them are even used in monadic operations. Hobbes obviously doesn't support all of them (we don't have real monads there); it's a matter of choice:

Then at least, we should let compiler panic and tell programmer his/her code probably has logic error, error message can be something like

I believe you didn't follow my previous explanation; that's exactly what the compiler complained about when you tried it in the REPL:

> a = 3
> a = 4
stdin:1,5-5: Variable already defined: a

It's a matter of scope definition, that's the point I was making earlier. In a REPL, that scope is the whole session. In a compiled program, that scope is delimited by let expressions, so there's no possible conflict.

I'm not saying hobbes IS haskell, despite multiple places in documentation mentions haskell. I'm saying hobbes should have the same behavior as haskell

Haskell allows the following, which Hobbes doesn't allow. Your whole point is that we shouldn't allow binding variable names 2x in let expression; and Haskell does allow this in certain cases. And you want Hobbes to have the same behavior as Haskell, which is to allow binding variable names to new values as in Haskell.

ghci> let a = 1
ghci> let a = 4
ghci> a + a
8

The above is plain Haskell. I don't follow you anymore.

smunix commented 1 year ago

If you are so concerned about the possibility for Java developers to get confused about the above, then the issue is that they don't understand the rewrite rules around let blocks. Remove that rewrite rule around syntactic sugars such as do and series of let. Then let (pun unintended) these devs write ground expressions with the simpler syntax let x = value in expr. No more do blocks, no more let blocks in Hobbes. I have to warn you, these syntactic sugars are there for a reason: that is to increase expressiveness for Hobbes developers.

mo-xiaoming commented 1 year ago

Let me set the record straight on what I've been talking about

  1. I've never said we should stop rewriting code from let a =3; a =4;... to let a0=3; let a1=4... internally
  2. What I said is, this rewriting has corner cases confused programmers, doesn't matter they got this kind of code by accidently copy-past, forgot what's going on in the beginning of this block, or even genuinely believe this code is legit
  3. The two examples I gave in my previous post, they are not from REPL. Like I mentioned above, I do know hobbes REPL behaves differently , so it should not be used for comparison
  4. Normal programmers do not nor should not care about the complier design internals for a simple =

Everything boils down to, in your words, a=3; a=4 is "absurd", but compiler silently accept that code, is this a fact?

My whole point from my first post is simple: Should the compiler be more helpful that it should tell programmers that theirs code makes no sense?

smunix commented 1 year ago

Let me set the record straight on what I've been talking about

  1. I've never said we should stop rewriting code from let a =3; a =4;... to let a0=3; let a1=4... internally
  2. What I said is, this rewriting has corner cases confused programmers, doesn't matter they got this kind of code by accidently copy-past, forgot what's going on in the beginning of this block, or even genuinely believe this code is legit
  3. The two examples I gave in my previous post, they are not from REPL. Like I mentioned above, I do know hobbes REPL behaves differently , so it should not be used for comparison

Why not? the principles of let evaluations stay the same if you understand they always evaluate in a context of variable assignments. These principles are the same whether you're in the REPL or in a compiled program. I strongly recommend you read about propositional logic and their evaluation judgments if you haven't already, because this is no different.

  1. Normal programmers do not nor should not care about the complier design internals for a simple =

This is highly subjective. What's a normal programmer? A Java developer? A C++ dev? or one who reads about the semantics of a symbol such as = before applying it?

If that's not clear,then it is important you document/remind to your users the semantics of = and clarify that it's not the same as the one in C++ or Java (with both having changing semantics for =. There, it's either declaration or assignment).

In Hobbes,

In C++,

It's okay to be different. Hobbes is not a C++ replacement language, and it serves different purposes.

Everything boils down to, in your words, a=3; a=4 is "absurd", but compiler silently accept that code, is this a fact?

You are ignoring the context (the scope), so this isn't 100% correct. This is absurd if you consider a single scope (as is done in the REPL), but it is not absurd if each binding introduces different scope (as that's the case in let evaluation for compiled programs).

One can't make a statement, show it's True for a single instance, then consider it a proof of it being universally True like you do here.

My whole point from my first post is simple: Should the compiler be more helpful that it should tell programmers that theirs code makes no sense?

Good point, but remember: the example you are flagging as being universally absurd is no different than do { a = 3; (do { a = 4; print a;}); print a; }. Would you call this absurd too?

How about the following? Is this absurd to you as well?

  mo.C ✕  +
   1 auto main () -> int {
   1 │ auto a = 3;
   2 │ │    a = 4;
   3 │ return a+a;
   4 }
mo-xiaoming commented 1 year ago

I'm surprised to see your response to the word "absurd", that makes me realized that I might misunderstand what you've said

It's equivalent to a proposition. They should read a = 1 as variable a binds to the value 1, or a is 1. Saying a is 1 and a is 2 is absurd, at least in propositional logic.

say, in hobbes code, when you see something like

let
  ...
  a = ...
  ...
  a = ...
in ...

Current compiler is happy to compile this code, but are there any better(correct?) ways to do whatever the programmer wants it do?

From your previous post, I thought your answer is Yes, there are better ways to fulfill his/her purpose

Seems I got your answer wrong. So it really depends on your answer to this, then we might have a totally different discussion

My standing on this PR is simple, it is all about ergonomics

  1. Unlike some intermediate to advanced features, we might have to understand how compilers implement them and why they implement it in this or that way. = supports to be simple, IMO, it is unreasonable to expect programmers to read through papers and thesis before pressing the = key
  2. We're human, we make mistakes. If I see code like a=...;a=...; in hobbes, I believe there might be better/clearer ways to do whatever the author of this code wants to do. Being fit in theory is less important. Let the compiler to be programmers' friend is more important

Let's say we have a c++ code like this

struct X {
  T foo() const;
};

Any C++ compilers will be happy to compile this code, because it is legit C++. but clang-tidy still warns me that it might be a good idea to add [[nodiscard]], because it is a pure function, without using its return value, I probably just heat up cpus for no reason, I either forget to remove some dead code, or there are logic errors buried deep in my code. And C++ is infamous for lacking of proper tools as you're well aware

We write bugs with perfect grammar, tools can help

I couldn't figure out what kind of functionalities in reality can only be achieved by a=...;a=..., so I consider programmers are making mistakes when they write code like this

smunix commented 1 year ago

What you want is a way for the compiler to check for potential name shadowing and flag them for you. You don't want to prevent advanced users from writing these expressions. You want a way to lint your programs, so the compiler can warn you on potential name shadowing issues. Make it an option, then choose a default that works for the largest audience.

Guess what? We do have support for Optional flags and the ability to extend the compiler with new flags. @dawa79 (Dan Wang) can walk you through how he added the -o no-Safe option extension to Hobbes, and you could do something similar.

There's a macro expansion stage in the compiler pipeline, where you can run your checks. Even better, you could run your linter at the same place as where Dan wrote the checks for unsafe expressions.

Don't ever change the parser though, for no reasons. You hardly ever need to change the parser, except for compelling cases. See the parser as the contract Hobbes has with its users. You never want to break this contract.

I couldn't figure out what kind of functionalities in reality can only be achieved by a=...;a=..., so I consider programmers are making mistakes when they write code like this

I'm lazy and find it hard to invent new variable names each time I need one. Seriously, I don't think variable names are an infinite resource; they are not. The compiler is the buddy that figures names out for me, so I can focus on engineering solutions for harder problems.

Isn't that the whole point about Lambda calculus? Names don't really matter... But I'm digressing, I'll stop it here now.