Guidelines for handling unexpected errors

There are some errors which should never happen, unless there is a big bug in the codebase. A function might be called with invalid arguments, the database might be missing something that should absolutely be there, or some other invariant might be broken.

There is no consensus on how these situations should be handled in the code - some people prefer using an assert to make sure that the invariants are always true, while others argue that we should avoid panics at all cost.

The purpose of this issue is to discuss pros and cons of different approaches and develop a set of guidelines that we could refer to. This should help with endless discussions on how to handle these situations.

Here are some arguments that were presented in previous discussions: 1) assert gives a clear signal that something is broken. It will be immediately noticed and swiftly fixed. An error might just happen silently, causing trouble without anyone noticing that something is wrong. 2) assert assures that the invariant holds no matter what. An unhandled error could cause the program to enter invalid state, but with asserts there's rock-solid certainty that the invariant is always true and the program will never enter an invalid state. 3) Crashing is very undesirable - using asserts makes it much more likely that the program might crash on some strange input that wasn't expected by the developer. This means that DoS vulnerabilities are more likely to exist, which could be a serious threat. An attacker who finds a DoS vulnerabilty could try to kick out some validators and achieve a majority of the stake. We should avoid introducing code that might lead to DoS vulnerabilities later on. assert is much more risky than an Error 4) The code constantly evolves - what seems like an obvious thing that will never happen might later turn into a real corner case of some complex system. The code changes, which can easily make previous assumptions untrue. Developers use existing code without checking if it contains any asserts, and they could accidentally trigger an assert somewhere. Expecting all code to be perfect is unrealistic, things should be as misuse-proof as possible, otherwise we'll end up with DoS vulnerabilities everywhere. asserts are not misuse-proof at all, Errors are much better.

And some initial ideas:

Use debug_assert! more - this assert is enabled only on special canary nodes which are allowed to crash without any security concerns. It allows to detect the problems without affecting the production mainnet nodes.
Add a sepecial way to log information about non-fatal bugs. A macro like nonfatal_bug! could notify node operator about the problem (e.g. via email), but return an error instead of crashing. This combines the observability benefits of assert with crash resistance of Errors (update - work in https://github.com/near/nearcore/pull/10659/)

Please share your thuoghts!

I am a big fan of the crash early concept. Some thoughts:

If you have a piece of code that is not properly tested, then you cannot be sure that that piece of code correctly works. Then you should not have it in the code base. Code that handles very esoteric errors are often not properly tested. They are add a lot of complexity.
As the link above suggests, when you have some sort of corruption and you do not think you can safely continue execution, the best thing to do is to end execution.
No piece of code should rely on graceful shutdown in order to function correctly. If your database code relies that the files will always be properly closed with all data properly written, at some point that assumption will fail. Similarly, if your networking code relies that you will always manage to cleanly close the connection with all data written and read by the client, that will eventually fail. These pieces of code should be able to start with potentially invalid states.

Crash early does work well in e.g. web applications where an esoteric request will at worst bring down a single server and something more reliable, like a load balancer, will reroute the rest of the traffic to the other instances.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

I do like the mechanism myself quite a bit too, but I fear that with how neard and the near protocol are structured, we cannot really take advantage of it in many places of our code base.

No piece of code should rely on graceful shutdown in order to function correctly.

While reliance on crash early is a good motivation for the code to not rely on graceful termination, external factors like SIGKILL or just somebody pulling the power plug are already there. So the code that's not meant to corrupt state should not assume graceful termination regardless.

In Rust world specifically explicit errors always end up being a part of the type state, and thus at least somewhat documented. Panics, asserts and such on the other hand are meant to be ignored. In fact, you could not do anything but ignore them all the way till the 1.9.0 release!

This usually means that for most Rust projects any unwinds being introduced into the code should be very well considered. The community has the guideline somewhere along the lines of

Panics should only be used to inform about programming mistakes.

So it is perhaps somewhat unreasonable to expect our developers to swim against the flow here, when the rest of the ecosystem subscribes to a different ideology altogether.

I personally found it more straightforward on myself to just always use Errors everywhere, even for said programming/invariant mistakes, by default. At my previous workplace we crafted a green-field project where to the best of my knowledge we did not introduce unwinds (especially not the implicit ones) of our own and religiously followed these guidelines. Not only was the resulting software rock stable and it's code easy to modify, but whenever there was an error condition, reasons leading to it naturally ended up being really easy to understand to the operators of the software as well!

Unfortunately applying the same strategy or guidelines for neard is a little too late (it is not a green field project!) but I don't think there is any reason to not follow the similar style for the new code that we write. At worst the error can be explicitly propagated until it is no longer possible to do so. This propagation is still explicit and whoever ends up dealing with anything in between will enjoy most of the benefits of the scheme.

here is my take, in order of priority

Any assumptions or function contracts should be coded directly into the function signature. Make it impossible for the function to be used incorrectly. Rust is fairly good at it. Ideally all assumptions should be encoded that way and there shouldn't be any need for asserts.
For non-critical issues it's best to use debug_asserts and log the error. For this to work we need a proper setup with canaries (including validators) running in debug mode and monitoring for errors. I believe we have the former but I don't know about the latter.
I dislike pure asserts and even worse implicit asserts (e.g. array access with unchecked index) because those create a potential attack vector. Asserts basically move the reponsibility to the caller of the asserting function, implicitly, and it is not covered by the compiler.

I dislike pure asserts

I think this is a bit more nuanced. There are some cases where asserts (or their cousins unwraps, etc.) can be a useful tool. There are always some types of corruptions that the application is not designed to cope with. When such corruptions are detected, there is no good reason to pass them around, instead just asserting can be fine. E.g. when you want to lock a mutex. Should you unwrap or not? If you do not want to bother handling the edge case of poisoned locks, then unwrapping is fine. Or asserting that a certain invariant hold which due to legacy code or just sheer amount of complexity, we decided not to properly capture in the types we have defined.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

Yes, this is absolutely true. If we have a systematic corruption that impacts all or large number nodes, then the entire network will enter an infinite crash loop. But what is the alternative here? It is not so difficult to throw an error to your caller if you experience an esoteric error. But if the caller is not designed to handle the error, then you are just increasing code complexity for little gain. So my suggestion would be that if we do this, we need to handle this in a case-by-case basis and we have to make sure that the error gets propagated properly all the way up to the appropriate level which is able to properly handle and recover from it. If we not building in proper recovery logic, then there is little point in propagating the error.

near / nearcore

Guidelines for handling unexpected errors #10656