hyperledger / iroha

Iroha - A simple, enterprise-grade decentralized ledger
https://wiki.hyperledger.org/display/iroha
Apache License 2.0
421 stars 275 forks source link

Blockchain backup RFC #3885

Open 6r1d opened 10 months ago

6r1d commented 10 months ago

There are two groups of approaches for the backups with different advantages and pitfalls: "online" (using other peers) and "offline" (relying on data stored on a separate medium).


The online approaches would be prone to catastrophic events. If the peers run on the physically nearby machines, something like an electric failure is a risk for all these machines.

The offline approach leads to many questions regarding the BFT and which data we should trust. We should discuss this in detail because, despite all the issues, it guarantees a significant part of the data is safe and would help our users. Block streaming may be a way to perform those, but this raises another question: "Which peer to stream from?".


According to our discussion today, we need to stabilize the API to have the ability to restore the previous version of the chain.

Erigara commented 10 months ago

I think we can use combination of "offline" and "online" approaches, we can choose some subset of peers in the network and make "offline" backup for their storage.

In case of whole network failure we can then recover network (probably with lose of some blocks) through loading "offline" backup and gossiping.

Mingela commented 10 months ago

My personal vision of the issue:

Native backup mechanism shouldn't be considered before the major features release

Referring to the experience in working with other services/applications, in most cases an infrastructure engineer is responsible for persisting the data thus maintaining a backup/recovery mechanism. Commonly they'd implement something you call the "offline" approach here, i.e. redundantly storing/mirroring the data directories somewhere else. I would strongly recommend inviting DevOps specialists to the discussion to make sure we don't allocate resources to something completely unnecessary. @6r1d please elaborate on the motivation and a business goal you're addressing with this suggestion as well as an overview of approaches other blockchains usually implement this. Imo, what called "online" is not really related to a backup, it's an essential property of a blockchain for a node to be able to catch the relevant up using p2p connections and perform a block (re-)validation & verification. Additionally, we might consider an improvement towards instantiating an empty node, currently it will download all blocks from the genesis. Some kind of network-trusted snapshots/checkpoints would reduce the synchronization time for such a node. p.s. a node of other type may also be helpful for a 'backup', i.e. syncing/archival.

Upgradeability should be addressed as a priority

Currently, a living project using Iroha, and willing to upgrade to a newer version, may encounter major inconvenience implied from an inability of performing native upgrades within Iroha. To achieve this for now there is the only option, which I'd say violates decentralization principles, and even worse than a hard-fork, exists. I'd recommend addressing this aspect as a priority instead.

mversic commented 10 months ago

at some point we discussed archive nodes in #3527. These nodes do not participate in a consensus but they receive blocks from the network. Could these nodes be used as a backup?

6r1d commented 10 months ago

Commonly they'd implement something you call the "offline" approach here, i.e. redundantly storing/mirroring the data directories somewhere else

What I'm worried about is the data being safely backed up. It is not new, but it wasn't discussed enough, given the current situation.

I would strongly recommend inviting DevOps specialists to the discussion to make sure we don't allocate resources to something completely unnecessary.

I will ask for recommendations from the DevOps; at the same time, how aware is the DevOps of Iroha architecture? So far, backups in Iroha look like an open question to me.

Additionally, we might consider an improvement towards instantiating an empty node, currently it will download all blocks from the genesis. Some kind of network-trusted snapshots/checkpoints would reduce the synchronization time for such a node.

I would like to know whether it needs to be stopped at a certain point so the backup can proceed. I don't know Kura deeply enough to claim it's safe or isn't. Randomly copying the data of a database with a journal may lead to damage, for example.

Native backup mechanism shouldn't be considered before the major features release

While I believe this should be an architecture-related consideration, the decision is yours to make

Upgradeability should be addressed as a priority

Certainly, we've discussed upgradeability as a part of the workflow.


at some point we discussed archive nodes. These nodes do not participate in a consensus but they receive blocks from the network. Could these nodes be used as a backup?

I believe they could be. I am not sure when to stop the node and proceed with the backup.

pesterev commented 10 months ago

I'm not sure if this issue should be solved at the network level or if the blockchain should handle it. I mean it should look like an off-chain service/tool/utility that is responsible for backing up all blocks using off-chain technologies (SQL databases or something).

Mingela commented 10 months ago

What I'm worried about is the data being safely backed up. It is not new, but it wasn't discussed enough, given the current situation.

Please elaborate on the concern. What exactly determines 'safety'?

I will ask for recommendations from the DevOps; at the same time, how aware is the DevOps of Iroha architecture? So far, backups in Iroha look like an open question to me.

If you think the architecture knowledge is required for the discussion please provide as much useful references as possible for the context.

I would like to know whether it needs to be stopped at a certain point so the backup can proceed. I don't know Kura deeply enough to claim it's safe or isn't. Randomly copying the data of a database with a journal may lead to damage, for example.

Please elaborate on the examples/concerns related to that. Why should it be stopped at all? We could consider snapshotting of a previous state in parallel to ongoing consensus participation. Researching a technical solution is a next step and we should not limit ourselves at this point.

While I believe this should be an architecture-related consideration, the decision is yours to make

I wouldn't say we should complicate this process with another layer of consensus around snapshots/backups or I just got a wrong impression of the statement. Please don't portrait me as the only decision maker, I just want to have as much context as possible to conveniently communicate with all stakeholders and perform further roadmap adjustments.

at some point we discussed archive nodes. These nodes do not participate in a consensus but they receive blocks from the network. Could these nodes be used as a backup?

Certainly.

6r1d commented 10 months ago

Please elaborate on the concern. What exactly determines 'safety'?

Iroha is a system with many parts that can influence the process and the amount of restored data. There's a BFT consensus in addition to the blockchain itself. The frequency of updates on the disk wasn't discussed often and I'm not sure if a scenario of Kura being damaged due to an unexpected stopping is possible. In my opinion, the maximum amount of data peers agree on that can be restored should be determined, and we should proceed from then on.

If you think the architecture knowledge is required for the discussion please provide as much useful references as possible for the context.

I believe I'm not the best person to do so, so I started discussing the architectural side with both @mversic and @Erigara, who added a lot of code to the recent codebase, as well as you since you're involved in Iroha 1 and other related projects. I'm not the best person to point out architectural aspects of Iroha, but I can imagine a realistic data loss scenario.

Please elaborate on the examples/concerns related to that. Why should it be stopped at all?

As I said before, "I don't know Kura deeply enough to claim it's safe or isn't".

If there's something like a journal, a part of the Kura is stored in RAM, and the Kura data depends on both, I see a risk with simply copying data: while data would be copied, the information may or may not be unreadable. I am not sure which risks are there, this is why I'm asking the people who have more information to decide.

BAStos525 commented 9 months ago

From DevOps side, it should be explained in more details for us regarding which application part backup is required. As for now, we can support iroha services fault-tolerance by pods replication. Or if it's requited to save, keep and restore application state data (perhaps this concerns the Kura subsystem), we can use external backups. We also have a successful case of iroha volumes and block storage backup and restore after peers fault.