nim-lang / RFCs

A repository for your Nim proposals.
135 stars 26 forks source link

Introduce `defects` / `forbidDefects` pragma to enable tracking defects #493

Open elcritch opened 1 year ago

elcritch commented 1 year ago

Abstract

Introduce a defects pragma to track defects similar to tracking exceptions via the raises pragma. This would include adding a parallel pragma to forbids like forbidDefects. Alternatively forbids may be able to extended to support both defects and exceptions since it's a mechanism for filtering and not declaration.

Goals would be to provide tracking of defects. It does include making the standard library consistent, though this would be useful tooling for that goal. It would also give end users more control over what error mechanisms they want to allow or to ban.

Motivation

It's somewhat confusing dealing with defects vs raises as an "end user". I didn't really know about defects for many months after learning and using Nim and thought raises:[] would cover all error cases. Little did I know! ;)

On the one hand raises:[] and forbids:[X] are helpful but can lead to missing important defects like IndexDefect when using a library. The standard library is a bit inconsistent between when exceptions and defects are used. I believe that this is partly due to not having the ability to usefully track the difference.

Prior to forbids:[XYZ] effects tracking weren't particularly useful for forbidding only a subset of effects. While forbids was a small change it made tracking and utilizing effects substantially more useful. Panics can be implemented using exceptions or not, so it makes since to keep them separate from raises, but now they don't benefit from this change.

Adding a defects: [] and forbidDefects: []/forbids:[defect] would enable more precise tracking. This would enable useful discussions on whether an exception or defect makes sense, while enabling panics:on and panics:off camps to do what they will with them.

Finally adding a defects effects pragma would – I believe – be a step toward making the standard library and other libraries more consistent.

Description

The idea of a defects: [] has been discussed before. However, this was before forbids:[XYZ] which I believe enhances the utility of both raises: [] and a possible defects: [].

The previous discussion in issue 180 propsed an alternative exceptions hierarchy. This seemed too complicated and would require a larger re-work with probably breaking changes. This proposal is more about refining the system as it exists to make things more consistent rather than replacing it.

One benefit of tracking defects and forbidDefects would be enabling users to know what error cases a library uses or can exhibit. One use case could be to avoid libraries that provide certain defects.

To be clear: this proposal isn't about fixing the defects/exceptions split or the standard library but providing basic tooling and annotations to help with those decisions. Ideally this could then be adopted in the standard library to make it more consistent.

There's lots of discussions on this topic. Here's a few I found:

Code Examples

Very short example to show a possible use cases. These aren't too well thought out but hopefully show some use cases:

type JsonLib*[T] = object
  data: cstring
  ...

proc get*[K, V](obj: JsonLib, key: K): Result[V] {.defects: [IndexDefect], raises: [].} =
  # ensure which defects are possible
  ...

proc`[]`*[K, V](obj: JsonLib, key: K): V {.forbidDefects: [IndexDefect], raises: [KeyError].} =
  # ensure that the index defect is handled and transformed into normal user exceptions
  ...

Alternatively forbids may be able to be extended to support both defects and exceptions while not causing conflicts between them:


proc get*[K, V](obj: JsonLib, key: K): Result[V] {.defects: [IndexDefect], raises: [].} =
  ...

proc`[]`*[K, V](obj: JsonLib, key: K): V {.forbids: [IndexDefect], raises: [KeyError].} =
  ...

Backwards Compatibility

As far as I know this wouldn't affect backwards compatibility.

Araq commented 1 year ago

We already have --panics:on though that turn Defects into panics. The problem that I prefer to solve is that Defects are not distinguished properly:

From the optimizer's perspective a panic that doesn't have to run destructors is highly preferred. And it's not just about performance either, interruptible panics are also very costly to implement.

beef331 commented 1 year ago

One of the biggest hidden defects is the RangeDefect. Might it make sense to do the cstring treatment and making them a warning on implicit conversions? Cause right now they're in a state that makes them very hard to use with confidence, there are no indications that you're possibly causing bugs.

elcritch commented 1 year ago

We already have --panics:on though that turn Defects into panics. The problem that I prefer to solve is that Defects are not distinguished properly:

You're right that for many users --panics:off would be useful as you outline. Though all my code uses --panics:off so that I can try and recover from as many situations and to report it. For my devices reporting any errors including all the ones you listed provides a lot of benefits as the devices can be installed in remote locations where I can't readily hookup a serial port to debug. :)

I know this use case is somewhat rare, but in my mind it's an important part of a "real" systems language. Though of course if you ask 8 programmers what "systems language" is you'll get 12.5 answers.

My position on panics is largely what Torvalds discussed in a linux mailing list due to the sorta problems I work on. Actually I've actually wanting to write a Linux driver in Nim – mostly for fun. Panics would be unacceptable in that scenario. This is part of why I dislike Rust programming -- handling all error cases with Result type means there's lots of panics and I never know which code will just kill the world.

However, that's a bit off topic of the PR. I believe that adding a defects:[] pragma would enable useful tooling now for both panics:off or panics:on cases. Essentially a defects:[] pragma would be a way to track any possible defects and ensure they're handled in some fashion. It doesn't seem like it'd get in the way of future changes.

For example, I could envision cases of using forbidDefects:[IndexDefect] to enforce that DrNim or similar is preventing such defects, etc. It can help focus developer time on certain key sections where handling any panics would be worthwhile. Say the core of the memory allocation logic where there's some casting between signed / unsigned ints. It'd be nice to have ways to leverage the compiler in that sorta code.

A array out of bounds access is a "bug" and bugs should be allowed to panic and it's reasonably easy to review the code and ensure these don't happen. This can be improved dramatically with static bounds checking that is provably safe. In other words, we can embrace DrNim's .requires and .ensures annotations and solve the problem at compile-time, eventually.

It would be awesome to prove index bounds safe. However, I'm skeptical we could do so in every case (inject some hand wavy pointing to the halting problem). IMHO there will always be some need for runtime bounds checking.

A bit further out there, but flips from cosmic rays are not only possible, but statistically guaranteed (see some stats). Obviously this isn't possible to be solved generally, but rather given that array indexing is such a key action, it seems prudent to be able to check it even it shouldn't be possible to incorrectly index something.

Integer overflows: These are very nasty and IMO it's dishonest to classify them as "bugs" in the sense that the programmer should have prevented them from happening in the first place. That's just too hard to do in practice.

Good points, though see above I don't trust even verified code to not run into these errors.

There's probably lots to learn from Ada/SPARK, which seems to still use a mix of compile and runtime checking:

Out of memory. Beyond my expertise but the best thing I can come up with is to map it to an abort mechanism. The OS/watchdog can restart.

What's an OS? ;) Yes you can reboot with a watchdog, but how to you record and log error/defects/etc if the world just dies?

Again I refer to Linus's reasoning in the LKML link. Ideally OOM should be treated in some cases as just another possible failure. You get a too large JSON to parse due to OOM, post it's often better to return an error and not kill the kernel, device, etc.

elcritch commented 1 year ago

@beef331 the RangeDefect would be a good use cases. It's one that'd be nice to check, as converting data from outside sources could result in a RangeDefect. With forbids:[] currently one could miss handling it.

beef331 commented 1 year ago

currently one could miss handling it

That's an understatement, I'd wager any library that uses subrange types has a range defect bug somewhere.

Araq commented 1 year ago

Conversions between subranges that are not value preserving at compile-time are explicit conversions. What's there to do about them, they are explicit.

beef331 commented 1 year ago
import std/times
echo parse("61", "mm")

Is the most poignant example I can provide that demonstrates the present issue with subrange types. To use the times.parse procedure one has to parse the time so they know the time is in the range of the time to not cause a defect. This bug is caused by lack of indicating that there is an issue since non range types convert to their range counterpart and there is no indication you need to program defensively.

Araq commented 1 year ago

That only demonstrates a problem with times.nim unless there is no explicit conversion code in times.nim.

beef331 commented 1 year ago

The conversion from base to subrange is implicit through the procedure call https://github.com/nim-lang/Nim/blob/version-1-6/lib/pure/times.nim#L2001 Which calls https://github.com/nim-lang/Nim/blob/version-1-6/lib/pure/times.nim#L1336-L1339

There is no visible explicit conversion. It's all done in procedure dispatch. Which I would like to believe hid the bug that the code presently does not ensure the parsed values are within range.

metagn commented 1 year ago

Does forbids even work for exceptions?

Also, I am uninformed about this discussion but if defects are so different from catchable exceptions (which you don't "catch", you "except") in their intended behavior but they still have to be exception types, why bother separating them in the type system and not how they're called, i.e. raise Exception() vs defect Exception(). That way you don't have to debate "oh are all overflow/range errors bugs or are some catchable", you can decide case by case.

Araq commented 1 year ago

There is no visible explicit conversion.

Bummer. That's bad. :-)

Araq commented 1 year ago

What's an OS? ;) Yes you can reboot with a watchdog, but how to you record and log error/defects/etc if the world just dies?

I don't know, but I assume that one can use some fixed size preallocated memory for communication with the watchdog.

Again I refer to Linus's reasoning in the LKML link.

Yes, I know Linus's "reasoning". But it makes little sense since in reality every function call can trigger an OOM just by the sake of being a call which needs stack space. And the stack size is 4KB or 8KB for the kernel.

Mapping both heap OOM and stack OOM to an exception/defect that can be turned into ENOMEM via try: logic() except OOM: errno = ENOMEM would be the real solution here yet this proposal is yet again more sophistry about how to do the impossible at compile-time. ;-)

We should embrace exceptions and forget about the idea of static checking runtime errors that are about arbitrary run-time resource constraints. But this would be against the current zeitgeist with .raises and .noMoreRaises and .reallyNoRaises and .raises the same set of exception that the callback raises.

elcritch commented 1 year ago

I don't know, but I assume that one can use some fixed size preallocated memory for communication with the watchdog.

It depends a lot of the device. At the simplest the watchdog is literally just a GPIO pin connected to a timer. Though you could leave a bit of memory for logging, though sometimes the watchdog will reset the whole power system and you'll loose the ram too. Generally you may be fine with loosing your networking as it restarts, but not a thread controlling hardware.

Hmmm, generally speaking I guess if a panic is limited to a thread that'd handle my concerns -- but I think panics kill a whole process? I actually sorta dig that concept...

Would changing panics to "thread-local" be possible or even help with the codegen optimization issues? If so then building on the task interface there as the boundary layer would be interesting. Nim on some of the RTOS'es might effectively already work this way since there's no signal handlers.

It reminds me of Erlang/Elixir's style of treating each thread (actor) as a separate process with exceptions but also "panics" of sorts. It requires architecting code differently but makes the overall system more predictable. If an actor dies, you can decide if it's parent actor dies too, or just gets an interrupt/signal to let it know. Divide by zeroes would effectively kill the current actor, and you can't catch it as a normal exception. https://learnyousomeerlang.com/errors-and-processes

Yes, I know Linus's "reasoning". But it makes little sense since in reality every function call can trigger an OOM just by the sake of being a call which needs stack space. And the stack size is 4KB or 8KB for the kernel.

That's a good point about the stack memory. I'm used to having a pre-allocated stack and a you-better-be-happy-with-it setup.

Hmmm, some googling seems to indicate that the kernel uses a static stack as well, so that 4kb/8kb seems to be all a kthread gets:

The size of the kernel stack is configured during compilation and remains fixed. This is usually two pages (8KB) for each thread. Moreover, additional per-CPU interrupt stacks are used to process external interrupts. While the process runs in user mode, these special stacks don’t have any useful data. https://www.baeldung.com/linux/kernel-stack-and-user-space-stack

Mapping both heap OOM and stack OOM to an exception/defect that can be turned into ENOMEM via try: logic() except OOM: errno = ENOMEM would be the real solution here yet this proposal is yet again more sophistry about how to do the impossible at compile-time. ;-)

That sounds like too much sophistry! ;) Though it sounds like linux kernel just requires the stack to be static, and you better not go over it. I'd be curious to know how Ada/SPARK treats it.

We should embrace exceptions and forget about the idea of static checking runtime errors that are about arbitrary run-time resource constraints. But this would be against the current zeitgeist with .raises and .noMoreRaises and .reallyNoRaises and .raises the same set of exception that the callback raises.

haha, fair point. Mainly I just like to know what things are possible.

Hence this PR. Even if I can't handle the case, it's nice to know what defects are possible.

Araq commented 1 year ago

It reminds me of Erlang/Elixir's style of treating each thread (actor) ...

Yes, Erlang gets this right.