nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

Deprecate 'unsafeAddr' and make 'addr' available for all addressable locations #369

Closed Araq closed 2 years ago

Araq commented 3 years ago

As the title says, this is a simple proposal: We allow 'addr' for where 'unsafeAddr' used to be required (consts, let variables, for loop variables and parameters) and deprecate 'unsafeAddr'.

Reasons

PMunch commented 3 years ago

I always thought addr was used to get an address to something that is actually allocated on the heap and can be shared with C programs or whatnot (as long as I make sure the object stays alive). While unsafeAddr where things that typically would live on the stack and which I have much less control over in terms of life-cycle and should therefore be treated with a bit more special care, hence the unsafe name.

ghost commented 3 years ago

@PMunch addr is actually used to get address of stack-allocated things for C interop quite often in Nim :P

PMunch commented 3 years ago

What's the distinction between the two then?

Araq commented 3 years ago

unsafeAddr is required for consts and parameters, addr doesn't work on these. unsafeAddr works on everything that addr works. So if you always use unsafeAddr, nobody notices and the compiler doesn't warn you.

c-blake commented 3 years ago

Since Nim identifies its unsafe subset superset mostly by a small set of keyword-like things (ptr, pointer, addr, cast, cstring, UncheckedArray, unsafeAddr), having that set be smaller makes describing this identification simpler/faster by 1/7 =~ 15%. ;-) Also the other things don't really need as much description for anyone coming from more unsafe languages like C/C++. They probably have mostly valid intuitions about them.

Correct me if I'm wrong, but besides parameters and const, the a let as well as the x in for x in yaddayadda also needs unsafeAddr. I know those are both almost the same, but let is pretty frequent in my experience. Anyway, it's a complicated shopping list to explain/remember..which is the why behind people just do addr and having the compiler tell them to switch to unsafeAddr.

Araq commented 3 years ago

You're correct and I've updated my description.

c-blake commented 3 years ago

Also, @PMunch, to possibly answer your question a bit more, the primary way in which pointers from unsafeAddr are less safe than regular addr is that once you have a pointer you can write to the memory address/edit the value. Nim pointer types have no notion of "pointer to constant" vs. "pointer to varying".

You could, for example, accidentally reverse the order of copyMem arguments (which are in "looking like assignment order dst = src" to look more like the information flow flow "src -> dst"). The compiler would have no way to know this mistake meant were writing to memory which was originally constant because a pointer is a pointer is a pointer. It becomes up to the user not the compiler to maintain the "constant-ness" contract by not changing the data.

There may be some other way in which it is "more unsafe", but I think this is the primary way unsafeAddr users take on slightly more responsibility than addr-only users.

I am not sure this analogy will help, but this is sort of why TaintedString was not so helpful and got retired. TaintedString was more about validating untrusted input, but it relates to knowing what you have to be careful about. It's something that seems to be the cause of mistakes, but the mistakes probably wind up being more subtle than just "wrote to read-only" or "didn't check enough". Even if it could help sometimes, as with unsafeAddr people just workaround the guard and so it doesn't help. They did similarly for TaintedString by converting to string and "validating at a convenient point in the code, um, later on" rather than wherever proc type signatures told them..or not validating at all still, but just discarding the guard at many inconvenient points in the code which then, once it is a habit, nullifies the value of the distinction.

I think the reality there was that validating input is so intertwined with parsing that this is where it happens, but a stdlib marking data "right at the entry point" was a bit too early since a lot of people pass on the data to a few procs before actually validating or pass it to multiple procs and don't always declare proc parameters as TaintedString consistently, etc., but all in the context of doing the checks TaintedString was asking anyway. So, here the failure was, I think, more about code organization and factoring and consistency.

With Nim most pointer code is pretty small/confined/easy to reason about (in my experience). Most people using pointers already know (maybe instinctively) to be even more careful about writes than reads. Even otherwise, were the code more "distributed" with lots of proc signatures with pointer or ptr, then like TaintedString, the origin of the passing around is where this unsafeAddr/addr distinction happens. So, by the time/in the place it could help you to know read-only origins, the visual "unsafe" tag is remote/distant anyway. I believe this is why it has very limited bug catching powers and hurts in complexity more than helps in bug-finding.

The same thing tends to happen with const in C/C++ where the pointer system does have a notion of pointers to read-only, but that can also become a mess fast in large code bases where people just cast const-ness away to get stuff done.

It's actually really tricky to provide people with escape hatches to various guard rails keeping them from driving off the road while also preventing such workarounds from becoming "bad" (or maybe not) habits that undermine the reason for even having the guard in the first place. It tends to happen in every programming language. It's all really a subset of the extremely generic security-convenience trade-off axis. Answers come down to how often certain security mistakes are made in practice, and in programming that usually means "attention". Once you are writing to pointers not reading, many programmers are already paying extra attention already. That is where you can corrupt memory not just fail on a bad read. But even if they weren't the times the distinction could most help have the distinction remote from the mistake. And maybe pointers in Nim are already rare enough to hit attention critical mass anyway.

c-blake commented 3 years ago

Also, I should have said that, for compatibility with old code, I think unsafeAddr should probably stick around as an alias for the new, less restricted addr..maybe for 2..4 years, but we have a fine deprecation facility for evolving such.

timotheecour commented 3 years ago

I'm against this RFC, even if everyone else seems to think it's a good idea. addr is unsafe, in large part because it allows stack addresss escaping their stackframe. This is actually actually a solvable problem, which I solved in https://github.com/nim-lang/Nim/pull/14976 (see extensive tests).

But unsafeAddr is strictly less safe than addr, violating type safety guarantees in more severe ways (eg const-ness or even ROM).

example 1

nim r main

when true:
  let s = "abc".cstring
  # s[0].addr[] = 'x' # CT error
  s[0].unsafeAddr[] = 'x' # RT error: SIGBUS
  echo s

example 2

nim r -b:js main

when true:
  type Foo = object
    counter: int
  proc fn(a: Foo) =
    # let b = a.addr # CT error
    let b = a.unsafeAddr # RT error below in js: TypeError: Cannot read property 'counter' of undefined
    echo b[].counter
    echo a
  var f: Foo
  fn(f)

what's worse, the RT error depends on context, eg it disappears if you comment echo a

example 3

nim r main with addr, it'll give CT error with unsafeAddr, you'll get different values at CT vs RT: CT: prints 5 RT: prints 3

what worse, this depends on Foo.sizeof; eg with -d:case2a you'd get RT: prints 5 instead

when defined case2:
  type Foo = object
    counter: int
    when defined case2a:
      data: array[100, byte]
  proc fn(a: Foo) =
    # let b = a.addr # CT error
    let b = a.unsafeAddr
    b[].counter += 2
  proc main =
    var f = Foo(counter: 3)
    fn(f)
    echo f.counter
  static: main()
  main()

In practice, I think the distinction never found a bug, you use 'addr', the compiler complains, you use 'unsafeAddr' instead, "problem solved".

I just showed 3 simple examples above where addr would give CT error and unsafeAddr would give runtime errors or inconsistent results or would depend on seemingly unrelated context; each backend is affected, in different ways: c, js VM.

unsafeAddr is a useful escape hatch when you know what you're doing (including compiler implementation details such as sizeof thresholds for pass by value vs by reference), but at least the compiler tells you you're in uncharted territory. If we started to conflate addr and unsafeAddr, you'd lose this distinction and it would make addr less safe, turning some CT errors into RT errors or worse, UB.

In summary, conflating addr and unsafeAddr doesn't fix any bugs or enable any new features, and it does make the langauge less safe, for no tangible benefit.

c-blake commented 3 years ago

No one ever said unsafeAddr was badly named in being exactly as safe as addr. Quantifying this is hard because it is NOT about examples of "how it works" as you seem above to think are somehow persuasive. Rather, you want software lifecycle examples where the complicating distinction prevented many programmers from making important mistakes instead of just reflexively changing addr to unsafeAddr, as @Araq explained in his opening comment. In short, since you aren't really even arguing from the same perspective, disagreement is unsurprising. The benefit was, again explicitly, language simplicity which it is true may be subjective not "tangible", but many/most benefits are!

It may clarify to discuss the opposite direction of this RFC which relates to this benefit. If unsafeAddr is helpful, ptrs remembering how they were created might be even more helpful, no? By default blocking writes to memory in some far removed part of the call stack from ptrs that came from unsafeAddr is "safer". This move opens up a whole "const correctness" can of worms {Nim gets a shout out in today's version! }. Do we go all the way, even past FFI boundaries for the C/C++ backends that have const pointers? How many programmers just cast away constness in const-correct languages? Should they? Maybe not. Would Nim programmers? This is the "same perspective playing field" upon which you should argue. It is admittedly quicksand.

Since it is very hard to assess all this objectively (or even systematically) people mostly argue by bald assertion. Here's one - it's already a very bad sign that long-time Nim users need to ask about the distinction. Maybe that happens because pointers are needed so rarely and keeping up that good work however/wherever & making it more rare rather than adding new ones is what matters in the end?