near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.33k stars 627 forks source link

Guidelines for handling unexpected errors #10656

Open jancionear opened 8 months ago

jancionear commented 8 months ago

There are some errors which should never happen, unless there is a big bug in the codebase. A function might be called with invalid arguments, the database might be missing something that should absolutely be there, or some other invariant might be broken.

There is no consensus on how these situations should be handled in the code - some people prefer using an assert to make sure that the invariants are always true, while others argue that we should avoid panics at all cost.

The purpose of this issue is to discuss pros and cons of different approaches and develop a set of guidelines that we could refer to. This should help with endless discussions on how to handle these situations.

Here are some arguments that were presented in previous discussions: 1) assert gives a clear signal that something is broken. It will be immediately noticed and swiftly fixed. An error might just happen silently, causing trouble without anyone noticing that something is wrong. 2) assert assures that the invariant holds no matter what. An unhandled error could cause the program to enter invalid state, but with asserts there's rock-solid certainty that the invariant is always true and the program will never enter an invalid state. 3) Crashing is very undesirable - using asserts makes it much more likely that the program might crash on some strange input that wasn't expected by the developer. This means that DoS vulnerabilities are more likely to exist, which could be a serious threat. An attacker who finds a DoS vulnerabilty could try to kick out some validators and achieve a majority of the stake. We should avoid introducing code that might lead to DoS vulnerabilities later on. assert is much more risky than an Error 4) The code constantly evolves - what seems like an obvious thing that will never happen might later turn into a real corner case of some complex system. The code changes, which can easily make previous assumptions untrue. Developers use existing code without checking if it contains any asserts, and they could accidentally trigger an assert somewhere. Expecting all code to be perfect is unrealistic, things should be as misuse-proof as possible, otherwise we'll end up with DoS vulnerabilities everywhere. asserts are not misuse-proof at all, Errors are much better.

And some initial ideas:

Please share your thuoghts!

akhi3030 commented 8 months ago

I am a big fan of the crash early concept. Some thoughts:

nagisa commented 8 months ago

Crash early does work well in e.g. web applications where an esoteric request will at worst bring down a single server and something more reliable, like a load balancer, will reroute the rest of the traffic to the other instances.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

I do like the mechanism myself quite a bit too, but I fear that with how neard and the near protocol are structured, we cannot really take advantage of it in many places of our code base.

No piece of code should rely on graceful shutdown in order to function correctly.

While reliance on crash early is a good motivation for the code to not rely on graceful termination, external factors like SIGKILL or just somebody pulling the power plug are already there. So the code that's not meant to corrupt state should not assume graceful termination regardless.


In Rust world specifically explicit errors always end up being a part of the type state, and thus at least somewhat documented. Panics, asserts and such on the other hand are meant to be ignored. In fact, you could not do anything but ignore them all the way till the 1.9.0 release!

This usually means that for most Rust projects any unwinds being introduced into the code should be very well considered. The community has the guideline somewhere along the lines of

Panics should only be used to inform about programming mistakes.

So it is perhaps somewhat unreasonable to expect our developers to swim against the flow here, when the rest of the ecosystem subscribes to a different ideology altogether.

I personally found it more straightforward on myself to just always use Errors everywhere, even for said programming/invariant mistakes, by default. At my previous workplace we crafted a green-field project where to the best of my knowledge we did not introduce unwinds (especially not the implicit ones) of our own and religiously followed these guidelines. Not only was the resulting software rock stable and it's code easy to modify, but whenever there was an error condition, reasons leading to it naturally ended up being really easy to understand to the operators of the software as well!

Unfortunately applying the same strategy or guidelines for neard is a little too late (it is not a green field project!) but I don't think there is any reason to not follow the similar style for the new code that we write. At worst the error can be explicitly propagated until it is no longer possible to do so. This propagation is still explicit and whoever ends up dealing with anything in between will enjoy most of the benefits of the scheme.

wacban commented 8 months ago

here is my take, in order of priority

akhi3030 commented 8 months ago

I dislike pure asserts

I think this is a bit more nuanced. There are some cases where asserts (or their cousins unwraps, etc.) can be a useful tool. There are always some types of corruptions that the application is not designed to cope with. When such corruptions are detected, there is no good reason to pass them around, instead just asserting can be fine. E.g. when you want to lock a mutex. Should you unwrap or not? If you do not want to bother handling the edge case of poisoned locks, then unwrapping is fine. Or asserting that a certain invariant hold which due to legacy code or just sheer amount of complexity, we decided not to properly capture in the types we have defined.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

Yes, this is absolutely true. If we have a systematic corruption that impacts all or large number nodes, then the entire network will enter an infinite crash loop. But what is the alternative here? It is not so difficult to throw an error to your caller if you experience an esoteric error. But if the caller is not designed to handle the error, then you are just increasing code complexity for little gain. So my suggestion would be that if we do this, we need to handle this in a case-by-case basis and we have to make sure that the error gets propagated properly all the way up to the appropriate level which is able to properly handle and recover from it. If we not building in proper recovery logic, then there is little point in propagating the error.