stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 970 forks source link

add a new mode to mitigate effect of a local node corruption #2925

Closed MonsieurNicolas closed 3 years ago

MonsieurNicolas commented 3 years ago

Something identified as part of hardening analysis.

Currently, in the event of a 1 ledger fork (bad quorum set configuration, or more likely some bug due to local corruption of the core state), the core node will get stuck and notice that it's on a fork after externalizing the bad state as it won't be able to reach consensus with the network. When that happens, the Horizon node (or any dependency), that consumes meta from that core instance will have ingested the bad data from a ledger that does not match the overall network state, and in turn other applications in the ecosystem may perform certain tasks with the expectation of finality of transactions - this is basically impossible to recover from in a sane way.

The solution is to adopt a new operating mode, driven by a configuration element UNSAFE_EMIT_META (with a default value of true for now), that limits the scope of that 1 ledger fork to the core node and does not allow the information to "leak out" to other parts of the system. The downside is to have a delay of 1 ledger in downstream systems (so perceived latency is longer).

We probably need to talk to the Horizon team a bit to make sure the transition to "safe meta" works as transparently as possible.

In a nutshell, when that mode is enabled:

bartekn commented 3 years ago

I think it shouldn't require any changes to Horizon codebase. However I can see some affected areas (possibly there's more):

rokopt commented 3 years ago

@bartekn Might it require any changes to tests? I think I recall that there are some captive-core tests that wait to see meta from ledgers that they expect to have closed before they proceed to some further phase. Or has that changed since moving away from manualclose in favor of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING? Or am I just confusing that with waiting for core to publish history (which I don't think we generally expect people to do with captive core)?

rokopt commented 3 years ago
  • note that "command line" modes may have to change a bit as we cannot emit the meta for the last ledger in a range
    • in many cases, we can just "add 1" internally to the last element in the range (when replaying historical data, we can close an extra ledger)

Perhaps this is necessary, but I feel like it'd be hard to look at code related to the CatchupConfiguration and immediately understand it when it might or might not have had 1 added to the target ledger, so I'm inclined to try to avoid this. Maybe:

Does that make sense?

  • for the latest ledger, a possibility to emit it anyways is to provide the hash of the following ledger (recovered offline, from a different core node for example) or just set UNSAFE_EMIT_META=true. I am not sure this is really needed though as trying to ingest the "latest" ledger outside of "online" mode probably doesn't make sense (but it may be needed for testing).

I'll ask Horizon whether they need this for testing.

rokopt commented 3 years ago

The solution is to adopt a new operating mode, driven by a configuration element UNSAFE_EMIT_META (with a default value of true for now)

I usually feel like an experimental feature would start out false by default and eventually become true, and I'd probably be surprised to find a parameter called UNSAFE defaulting to true. What about SAFE_EMIT_META defaulting at first to false? Or do you think it's important to call out the "un" part of UNSAFE?

rokopt commented 3 years ago

My plans so far:

Because we always emit earlier meta before allocating memory for new meta in closeLedger, we never hold multiple LedgerCloseMetas in memory at once. However, we do now retain a LedgerCloseMeta during the intervals when we're not in the middle of closeLedger. I suspect that this is alright because I don't think that anything that we're doing during those intervals is likely to require a lot more memory than we need during closeLedger itself, but if I'm wrong about that, then that could be an issue. (Avoiding it would probably require compacting the stashed meta in some way during those intervals?)

rokopt commented 3 years ago

I think it shouldn't require any changes to Horizon codebase. However I can see some affected areas (possibly there's more):

  • Transaction submission - Horizon will see the final status of tx a few seconds slower, the minimum HTTP response time of POST /transactions will be the duration between the last two ledgers close times,
  • Alerts/metrics - alerts about ledger delay will now trigger faster,
  • High frequency trading - traders connected to UNSAFE_EMIT_META=true Stellar-Core have advantage over other users (can see trades from the last ledger faster),
  • Fees - as above but regarding fee bidding.

Thanks for all these good points. I think these all fall into the category of "stuff someone is inevitably accepting by setting SAFE_EMIT_META, in return for the safety", but one interesting question is what we should document about this (for example, I'm guessing we probably won't encourage high-frequency trading with unsafe meta :) ).

MonsieurNicolas commented 3 years ago

Perhaps this is necessary, but I feel like it'd be hard to look at code related to the CatchupConfiguration and immediately understand it when it might or might not have had 1 added to the target ledger, so I'm inclined to try to avoid this. Maybe:

I said "may" because you need to think through what happens when Horizon restarts. Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).

The invariant that we're trying to maintain is that we emit metadata only for ledgers whose hashes have been validated by some source that we trust, such as consensus (when the network externalizes the next ledger, thus confirming that it agreed with our hash of the previous one, which we can now emit), or a history archive, or maybe a hash on the command line as I think you're discussing below.

yes this is correct

When closeLedger begins on ledger N+1, before it allocates a structure to hold meta for ledger N+1, it should emit meta that it has been retaining, if any, for ledger N. This could be factored out into a method (emitAnyStoredSafeMetaAfterLedgerValidation?). That method could be made public, and callers could use it whenever they somehow validate the has of the last closed ledger. In particular, ApplyCheckpointWork::onRun could call it if/when it confirms that lm.getLastClosedLedgerHeader().hash == mHeaderHistoryEntry.hash, which I think addresses the case where we replay from history. (I'm not sure yet what the others of the "many" cases are!)

Maybe? I guess are there cases where we are verifying without calling closeLedger that requires calling this method separately? If not we don't need this method at all right?

I usually feel like an experimental feature would start out false by default and eventually become true, and I'd probably be surprised to find a parameter called UNSAFE defaulting to true. What about SAFE_EMIT_META defaulting at first to false? Or do you think it's important to call out the "un" part of UNSAFE?

existing behavior is unsafe, and when we're done with this we want to discourage people from enabling this flag, so yes the "un" is important (we have other config elements like this that we mark as "unsafe"). I would prefer avoiding introducing a config element that we need to rename later (as this would be a breaking change for everybody).

bartekn commented 3 years ago

@bartekn Might it require any changes to tests? I think I recall that there are some captive-core tests that wait to see meta from ledgers that they expect to have closed before they proceed to some further phase. Or has that changed since moving away from manualclose in favor of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING? Or am I just confusing that with waiting for core to publish history (which I don't think we generally expect people to do with captive core)?

I don't think it would require changes to tests. We can always enable UNSAFE_EMIT_META=true in integration tests.

As far as I remember, the problem in tests was connected to waiting for the first checkpoint ledger. But @2opremio will know better.

2opremio commented 3 years ago

I think it will just work transparently, (but it will cause tests to be slower due to the extra synchronous submission delay).

rokopt commented 3 years ago

I said "may" because you need to think through what happens when Horizon restarts. Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).

Do you mean restarting after an offline catchup to some ledger N? It's true that stellar-core won't be able to emit meta for N after a restart, but it should have done so by the end of the offline catchup, just via the call to ApplyCheckpointWork::onRun after catchup as opposed to by having added 1 to the target ledger in the CatchupConfiguration. I might be misunderstanding your question, though.

I guess are there cases where we are verifying without calling closeLedger that requires calling this method separately?

Right -- specifically after we close a ledger during a catchup, ApplyCheckpointWork::onRun checks the hash of that ledger against the one in the history archive. If they match, I think that means we can emit the meta even though we haven't closed the next ledger yet, because we've now validated the ledger in a different but equally-trusted way. Specifically:


            if (mConditionalWork->getState() == State::WORK_SUCCESS)
            {
                auto& lm = mApp.getLedgerManager();

                CLOG_DEBUG(History, "{}",
                           xdr_to_string(lm.getLastClosedLedgerHeader(),
                                         "LedgerManager LCL"));

                CLOG_DEBUG(History, "{}",
                           xdr_to_string(mHeaderHistoryEntry, "Replay header"));
                if (lm.getLastClosedLedgerHeader().hash !=
                    mHeaderHistoryEntry.hash)
                {
                    mApplyLedgerFailure.Mark();
                    throw std::runtime_error(fmt::format(
                        "replay of {:s} produced mismatched ledger hash {:s}",
                        LedgerManager::ledgerAbbrev(mHeaderHistoryEntry),
                        LedgerManager::ledgerAbbrev(
                            lm.getLastClosedLedgerHeader())));
                }

                lm.emitAnyStoredSafeMetaAfterLedgerValidation(); // <-- planned new call

                mApplyLedgerSuccess.Mark();
            }

existing behavior is unsafe, and when we're done with this we want to discourage people from enabling this flag, so yes the "un" is important (we have other config elements like this that we mark as "unsafe"). I would prefer avoiding introducing a config element that we need to rename later (as this would be a breaking change for everybody).

Aha, that makes sense. UNSAFE it is.

MonsieurNicolas commented 3 years ago

Do you mean restarting after an offline catchup to some ledger N? It's true that stellar-core won't be able to emit meta for N after a restart, but it should have done so by the end of the offline catchup, just via the call to ApplyCheckpointWork::onRun after catchup as opposed to by having added 1 to the target ledger in the CatchupConfiguration. I might be misunderstanding your question, though.

I am not talking about restarting after offline catchup. Captive core is an "all in memory" mode that relies on Horizon to know where to start from, so I am asking if there is an edge case here that we should be thinking about. I am not too familiar with what they pass in exactly (in terms of trusted ledger number) vs which meta they expect to be emitted.

Right -- specifically after we close a ledger during a catchup, ApplyCheckpointWork::onRun checks the hash of that ledger against the one in the history archive. If they match, I think that means we can emit the meta even though we haven't closed the next ledger yet, because we've now validated the ledger in a different but equally-trusted way

Oh I see, you're saying that because we have a verified LedgerHeader in that context (how we got the transaction set, etc), we can emit the meta right away in that context if the lcl hash matches the trusted one. I think this works yes.

rokopt commented 3 years ago

I am not talking about restarting after offline catchup. Captive core is an "all in memory" mode that relies on Horizon to know where to start from, so I am asking if there is an edge case here that we should be thinking about. I am not too familiar with what they pass in exactly (in terms of trusted ledger number) vs which meta they expect to be emitted.

Aha, I see. I'm even less familiar with what Horizon passes in, so maybe @bartekn or @2opremio can explain, and @tamirms has also been helping me get Horizon's CI going on my test branch, so I should be able to get a number of edge cases run before I commit anything rash. ;)

Oh I see, you're saying that because we have a verified LedgerHeader in that context (how we got the transaction set, etc), we can emit the meta right away in that context if the lcl hash matches the trusted one. I think this works yes.

Yes, exactly. Thanks!

rokopt commented 3 years ago

I must also figure out where/how else we might decide that a ledger can be trusted so that we can call emitAnyStoredSafeMetaAfterLedgerValidation.

I haven't actually found any other place besides the one that I cited in ApplyCheckpointWork::onRun where I think that calling emitAnyStoredSafeMetaAfterLedgerValidation would be safe and not redundant with ApplyCheckpointWork::onRun. The continuation-style catchup work sequence appears to me to be nicely factored so that that's the one place where we validate the hash of a ledger when we're not planning to try to close another one as part of catchup (in the latter case it would be redundant as we'll emit the currently-stashed meta when we do close that next ledger). However, my familiarity with some of the code dates to as recently as a few minutes ago, so it is not difficult to imagine that I'm missing something.

rokopt commented 3 years ago

I said "may" because you need to think through what happens when Horizon restarts. Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).

My reading of https://github.com/stellar/go/blob/88805205b165b86d2daa9d5f48bd88181bafa3d5/ingest/ledgerbackend/captive_core_backend.go#L300-L356 and understanding of previous captive-core discussions with Horizon is that Horizon always launches a captive core using --start-at-ledger to tell it to catch up to the checkpoint before the last ledger that Horizon has in its database, then replay meta from there. But I'll re-check.