hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
311 stars 136 forks source link

Halt transaction processing cleanly if the free disk space crosses a threshold #3073

Open rbair23 opened 2 years ago

rbair23 commented 2 years ago

Problem

It is possible that under some condition (like bad credentials to the AWS/Google buckets) a node is unable to upload record files for an extended period of time. If that were to happen, we need backpressure to prevent the consensus/executor main node from producing blocks that cannot be written to disk, which would cause a loss of record files. If the virtual merkle disk space fills up, the system should also stop writing records and processing transactions.

Solution

Have a configuration property for the amount of free disk space that must be available for continued operation. If we detect this free disk space threshold has been exceeded, then we should halt handling transactions or participating in gossip or accepting new transactions for gossip until we have more available disk space. Today we will crash messily. It isn't a problem because we can handle a node crashing messily and survive as a network just fine. But we should improve this to degrade cleanly and issue information in the logs (ERROR/FATAL level) indicating that the disk space was used up. We should also have a threshold at which we WARN the node operator of the pending disk full condition. We don't want to actually fill up all disk because if we did, it would make it harder for an operator to recover the system. So we want to monitor this ourselves.

It is probably sufficient to monitor this once per round.

Alternatives

No response

netopyr commented 4 months ago

In my understanding, the issue asks for three different improvements, which are only somewhat related:

  1. Stop processing if we cannot upload record files. -> I would like to drop this as it becomes obsolete with block streams. We should implement such a mechanism for block streams though.
  2. Stop processing if the virtual merkle disk space fills up. -> This makes sense. I wonder, though, if the application or platform should monitor this.
  3. Monitor and log warnings/error messages. -> I would like to drop this, too. Monitoring the system and sending out alerts should be the sole responsibility of Grafana (or whatever other system is used to monitor metrics). Also, I wonder if writing to the log is the right thing to do if you struggle with disk space. 🙂
rbair23 commented 4 months ago

Stop processing if we cannot upload record files. -> I would like to drop this as it becomes obsolete with block streams. We should implement such a mechanism for block streams though.

I agree, we don't need to this for record streams, but definitely make sure we do for block streams.

Stop processing if the virtual merkle disk space fills up. -> This makes sense. I wonder, though, if the application or platform should monitor this.

I think it is at the application level. The application should have the config for such limits, and provides the directory into which the database writes files, and should also be querying the OS to figure out how much space remains.

Monitor and log warnings/error messages. -> I would like to drop this, too. Monitoring the system and sending out alerts should be the sole responsibility of Grafana (or whatever other system is used to monitor metrics). Also, I wonder if writing to the log is the right thing to do if you struggle with disk space. 🙂

I'm thinking that the configured limits might be different than actual OS limits. For example, I might have 10TB disk, but configure the node to use no more than 4TB because I have other requirements as well. If that is true, then we need to log or have a metric of some kind so Grafana can know what to do.

I might have several mount points, and my database is on /opt or somewhere else. But other mount points might have a lot of free space. I'm not sure how to configure grafana so it can know that it is alerting on Node A on free disk of one filesystem, and on Node B on free disk on a different filesystem.