microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
784 stars 215 forks source link

cchost crashes in case a corrupt ledger file is found on a node that is joining the network #6612

Open gaurav137 opened 3 weeks ago

gaurav137 commented 3 weeks ago

If the path under https://github.com/microsoft/CCF/blob/f1bd349ba7de81fcc56fd23670ce49ff4dd42a52/src/host/ledger.h#L321 gets hit then the malformed/corrupt ledger file is not getting ignored when a node starts from a later snapshot but has this older uncommitted ledger file in its ledger directory.

2024-11-05T05:12:11.384963Z        100 [fail ] ../src/host/ledger.h:312             | Malformed incomplete ledger file /mnt/storage/ledger/ledger_19 at seqno 32 (expecting entry of size 54978, remaining 49144)
2024-11-05T05:12:11.415505Z        100 [debug] ../src/host/ledger.h:1107            | Recovering file from main ledger directory: ledger_19
gaurav137 commented 3 weeks ago

More generally if a node is starting in join mode with uncommitted ledger files in its ledger directory that are further behind than the committed snapshot files in its snapshot directory then the uncommitted ledger files should get ignored and not interfere with node start up. The situation I faced was eventually the below (after multiple scale up/down/ recovery attempts):

2024-11-04T14:40:27.733657Z -0.012 0   [info ][gov] ode/gov/handlers/recovery.h:170 | 1/1 recovery shares successfully submitted
End of recovery procedure initiated - initiating recovery
2024-11-04T14:40:27.741599Z -0.020 0   [info ][gov] /gov/gov_endpoint_registry.h:58 | RequestCompletedEvent: POST /recovery/members/{memberId}:recover 200 0ms 1 attempt(s)
2024-11-04T14:40:28.702008Z -0.004 0   [info ] ../src/node/node_state.h:2167        | Initiating end of recovery (primary)
2024-11-04T14:40:28.705587Z -0.008 0   [info ] ../src/node/snapshot_serdes.h:111    | Deserialising snapshot (size: 457616, public only: false)
2024-11-04T14:40:28.705679Z -0.008 0   [info ] ../src/node/snapshot_serdes.h:123    | Snapshot successfully deserialised at seqno 117
2024-11-04T14:40:28.705692Z        100 [fail ] ../src/host/ledger.h:489             | Cannot find entries: 118 - 31 in ledger file ledger_19
2024-11-04T14:40:28.705702Z        100 [debug] ../src/host/ledger.h:1435            | Ledger commit: 150/150
2024-11-04T14:40:28.761173Z        100 [fail ] ../src/host/main.cpp:779             | Exception in ccf::run: std::exception
2024-11-04T14:40:28.761947Z -0.064 0   [fail ] ../src/ds/messaging.h:170            | Exception while processing message <::consensus::ledger_no_entry_range:1107064419> of size 17
libc++abi: terminating due to uncaught exception of type std::exception: std::exception

Per my understanding of what happened: the node started up with ledger_19 file around and also with committed snapshot with seq no 117 and the presence of ledger_19 file resulted in cchost crashing.