Closed MonsieurNicolas closed 3 years ago
I think it shouldn't require any changes to Horizon codebase. However I can see some affected areas (possibly there's more):
POST /transactions
will be the duration between the last two ledgers close times,UNSAFE_EMIT_META=true
Stellar-Core have advantage over other users (can see trades from the last ledger faster),@bartekn Might it require any changes to tests? I think I recall that there are some captive-core tests that wait to see meta from ledgers that they expect to have closed before they proceed to some further phase. Or has that changed since moving away from manualclose
in favor of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
? Or am I just confusing that with waiting for core to publish history (which I don't think we generally expect people to do with captive core)?
- note that "command line" modes may have to change a bit as we cannot emit the meta for the last ledger in a range
- in many cases, we can just "add 1" internally to the last element in the range (when replaying historical data, we can close an extra ledger)
Perhaps this is necessary, but I feel like it'd be hard to look at code related to the CatchupConfiguration
and immediately understand it when it might or might not have had 1
added to the target ledger, so I'm inclined to try to avoid this. Maybe:
closeLedger
begins on ledger N+1
, before it allocates a structure to hold meta for ledger N+1
, it should emit meta that it has been retaining, if any, for ledger N
. This could be factored out into a method (emitAnyStoredSafeMetaAfterLedgerValidation
?).ApplyCheckpointWork::onRun
could call it if/when it confirms that lm.getLastClosedLedgerHeader().hash == mHeaderHistoryEntry.hash
, which I think addresses the case where we replay from history. (I'm not sure yet what the others of the "many" cases are!)Does that make sense?
- for the latest ledger, a possibility to emit it anyways is to provide the hash of the following ledger (recovered offline, from a different core node for example) or just set UNSAFE_EMIT_META=true. I am not sure this is really needed though as trying to ingest the "latest" ledger outside of "online" mode probably doesn't make sense (but it may be needed for testing).
I'll ask Horizon whether they need this for testing.
The solution is to adopt a new operating mode, driven by a configuration element
UNSAFE_EMIT_META
(with a default value oftrue
for now)
I usually feel like an experimental feature would start out false
by default and eventually become true
, and I'd probably be surprised to find a parameter called UNSAFE
defaulting to true
. What about SAFE_EMIT_META
defaulting at first to false
? Or do you think it's important to call out the "un" part of UNSAFE
?
My plans so far:
SAFE_EMIT_META
defaulting to false
(though see question above; maybe this will go back to UNSAFE_EMIT_META
defaulting to true
after further discussion). It requires MODE_USES_IN_MEMORY_LEDGER
and a non-empty METADATA_OUTPUT_STREAM
.unique_ptr<LedgerCloseMeta> mLastLedgerMeta
to LedgerManagerImpl
. In closeLedger
before we allocate a new local ledgerCloseMeta
, we call a new emitAnyStoredSafeMetaAfterLedgerValidation
to check whether we have any mLastLedgerMeta
and, if so, flush it to mMetaStream
.ledgerCloseMeta
late in closeLedger
, if SAFE_EMIT_META=true
, then instead of flushing the meta, we stash it in mLastLedgerMeta
.ApplyCheckpointWork::onRun
confirms lm.getLastClosedLedgerHeader().hash == mHeaderHistoryEntry.hash
, then it calls emitAnyStoredSafeMetaAfterLedgerValidation
(which is public for this kind of purpose), as it has confirmed with a trusted source that the last closed ledger is not corrupted.LedgerCloseMetaStreamTests
to test watcher and replay configurations with SAFE_EMIT_META
set.emitAnyStoredSafeMetaAfterLedgerValidation
. The required code seems suspiciously small so far, which makes me think I am missing some places. ;-)Because we always emit earlier meta before allocating memory for new meta in closeLedger
, we never hold multiple LedgerCloseMeta
s in memory at once. However, we do now retain a LedgerCloseMeta
during the intervals when we're not in the middle of closeLedger
. I suspect that this is alright because I don't think that anything that we're doing during those intervals is likely to require a lot more memory than we need during closeLedger
itself, but if I'm wrong about that, then that could be an issue. (Avoiding it would probably require compacting the stashed meta in some way during those intervals?)
I think it shouldn't require any changes to Horizon codebase. However I can see some affected areas (possibly there's more):
- Transaction submission - Horizon will see the final status of tx a few seconds slower, the minimum HTTP response time of
POST /transactions
will be the duration between the last two ledgers close times,- Alerts/metrics - alerts about ledger delay will now trigger faster,
- High frequency trading - traders connected to
UNSAFE_EMIT_META=true
Stellar-Core have advantage over other users (can see trades from the last ledger faster),- Fees - as above but regarding fee bidding.
Thanks for all these good points. I think these all fall into the category of "stuff someone is inevitably accepting by setting SAFE_EMIT_META
, in return for the safety", but one interesting question is what we should document about this (for example, I'm guessing we probably won't encourage high-frequency trading with unsafe meta :) ).
Perhaps this is necessary, but I feel like it'd be hard to look at code related to the CatchupConfiguration and immediately understand it when it might or might not have had 1 added to the target ledger, so I'm inclined to try to avoid this. Maybe:
I said "may" because you need to think through what happens when Horizon restarts.
Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta
set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).
The invariant that we're trying to maintain is that we emit metadata only for ledgers whose hashes have been validated by some source that we trust, such as consensus (when the network externalizes the next ledger, thus confirming that it agreed with our hash of the previous one, which we can now emit), or a history archive, or maybe a hash on the command line as I think you're discussing below.
yes this is correct
When closeLedger begins on ledger N+1, before it allocates a structure to hold meta for ledger N+1, it should emit meta that it has been retaining, if any, for ledger N. This could be factored out into a method (emitAnyStoredSafeMetaAfterLedgerValidation?). That method could be made public, and callers could use it whenever they somehow validate the has of the last closed ledger. In particular, ApplyCheckpointWork::onRun could call it if/when it confirms that lm.getLastClosedLedgerHeader().hash == mHeaderHistoryEntry.hash, which I think addresses the case where we replay from history. (I'm not sure yet what the others of the "many" cases are!)
Maybe?
I guess are there cases where we are verifying without calling closeLedger
that requires calling this method separately? If not we don't need this method at all right?
I usually feel like an experimental feature would start out false by default and eventually become true, and I'd probably be surprised to find a parameter called UNSAFE defaulting to true. What about SAFE_EMIT_META defaulting at first to false? Or do you think it's important to call out the "un" part of UNSAFE?
existing behavior is unsafe, and when we're done with this we want to discourage people from enabling this flag, so yes the "un" is important (we have other config elements like this that we mark as "unsafe"). I would prefer avoiding introducing a config element that we need to rename later (as this would be a breaking change for everybody).
@bartekn Might it require any changes to tests? I think I recall that there are some captive-core tests that wait to see meta from ledgers that they expect to have closed before they proceed to some further phase. Or has that changed since moving away from manualclose in favor of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING? Or am I just confusing that with waiting for core to publish history (which I don't think we generally expect people to do with captive core)?
I don't think it would require changes to tests. We can always enable UNSAFE_EMIT_META=true
in integration tests.
As far as I remember, the problem in tests was connected to waiting for the first checkpoint ledger. But @2opremio will know better.
I think it will just work transparently, (but it will cause tests to be slower due to the extra synchronous submission delay).
I said "may" because you need to think through what happens when Horizon restarts. Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).
Do you mean restarting after an offline catchup to some ledger N
? It's true that stellar-core
won't be able to emit meta for N
after a restart, but it should have done so by the end of the offline catchup, just via the call to ApplyCheckpointWork::onRun
after catchup as opposed to by having added 1
to the target ledger in the CatchupConfiguration
. I might be misunderstanding your question, though.
I guess are there cases where we are verifying without calling
closeLedger
that requires calling this method separately?
Right -- specifically after we close a ledger during a catchup, ApplyCheckpointWork::onRun
checks the hash of that ledger against the one in the history archive. If they match, I think that means we can emit the meta even though we haven't closed the next ledger yet, because we've now validated the ledger in a different but equally-trusted way. Specifically:
if (mConditionalWork->getState() == State::WORK_SUCCESS)
{
auto& lm = mApp.getLedgerManager();
CLOG_DEBUG(History, "{}",
xdr_to_string(lm.getLastClosedLedgerHeader(),
"LedgerManager LCL"));
CLOG_DEBUG(History, "{}",
xdr_to_string(mHeaderHistoryEntry, "Replay header"));
if (lm.getLastClosedLedgerHeader().hash !=
mHeaderHistoryEntry.hash)
{
mApplyLedgerFailure.Mark();
throw std::runtime_error(fmt::format(
"replay of {:s} produced mismatched ledger hash {:s}",
LedgerManager::ledgerAbbrev(mHeaderHistoryEntry),
LedgerManager::ledgerAbbrev(
lm.getLastClosedLedgerHeader())));
}
lm.emitAnyStoredSafeMetaAfterLedgerValidation(); // <-- planned new call
mApplyLedgerSuccess.Mark();
}
existing behavior is unsafe, and when we're done with this we want to discourage people from enabling this flag, so yes the "un" is important (we have other config elements like this that we mark as "unsafe"). I would prefer avoiding introducing a config element that we need to rename later (as this would be a breaking change for everybody).
Aha, that makes sense. UNSAFE
it is.
Do you mean restarting after an offline catchup to some ledger N? It's true that stellar-core won't be able to emit meta for N after a restart, but it should have done so by the end of the offline catchup, just via the call to ApplyCheckpointWork::onRun after catchup as opposed to by having added 1 to the target ledger in the CatchupConfiguration. I might be misunderstanding your question, though.
I am not talking about restarting after offline catchup. Captive core is an "all in memory" mode that relies on Horizon to know where to start from, so I am asking if there is an edge case here that we should be thinking about. I am not too familiar with what they pass in exactly (in terms of trusted ledger number) vs which meta they expect to be emitted.
Right -- specifically after we close a ledger during a catchup, ApplyCheckpointWork::onRun checks the hash of that ledger against the one in the history archive. If they match, I think that means we can emit the meta even though we haven't closed the next ledger yet, because we've now validated the ledger in a different but equally-trusted way
Oh I see, you're saying that because we have a verified LedgerHeader
in that context (how we got the transaction set, etc), we can emit the meta right away in that context if the lcl hash matches the trusted one. I think this works yes.
I am not talking about restarting after offline catchup. Captive core is an "all in memory" mode that relies on Horizon to know where to start from, so I am asking if there is an edge case here that we should be thinking about. I am not too familiar with what they pass in exactly (in terms of trusted ledger number) vs which meta they expect to be emitted.
Aha, I see. I'm even less familiar with what Horizon passes in, so maybe @bartekn or @2opremio can explain, and @tamirms has also been helping me get Horizon's CI going on my test branch, so I should be able to get a number of edge cases run before I commit anything rash. ;)
Oh I see, you're saying that because we have a verified
LedgerHeader
in that context (how we got the transaction set, etc), we can emit the meta right away in that context if the lcl hash matches the trusted one. I think this works yes.
Yes, exactly. Thanks!
I must also figure out where/how else we might decide that a ledger can be trusted so that we can call
emitAnyStoredSafeMetaAfterLedgerValidation
.
I haven't actually found any other place besides the one that I cited in ApplyCheckpointWork::onRun
where I think that calling emitAnyStoredSafeMetaAfterLedgerValidation
would be safe and not redundant with ApplyCheckpointWork::onRun
. The continuation-style catchup work sequence appears to me to be nicely factored so that that's the one place where we validate the hash of a ledger when we're not planning to try to close another one as part of catchup (in the latter case it would be redundant as we'll emit the currently-stashed meta when we do close that next ledger). However, my familiarity with some of the code dates to as recently as a few minutes ago, so it is not difficult to imagine that I'm missing something.
I said "may" because you need to think through what happens when Horizon restarts. Did you sketch out the sequence to make sure that we're not going to miss one ledger? On restart, we won't have mLastLedgerMeta set until the first ledger gets closed which means that when core closes the first ledger based on its initial state, it won't be able to emit the meta from the previous ledger (whatever that is).
My reading of https://github.com/stellar/go/blob/88805205b165b86d2daa9d5f48bd88181bafa3d5/ingest/ledgerbackend/captive_core_backend.go#L300-L356 and understanding of previous captive-core discussions with Horizon is that Horizon always launches a captive core using --start-at-ledger
to tell it to catch up to the checkpoint before the last ledger that Horizon has in its database, then replay meta from there. But I'll re-check.
Something identified as part of hardening analysis.
Currently, in the event of a 1 ledger fork (bad quorum set configuration, or more likely some bug due to local corruption of the core state), the core node will get stuck and notice that it's on a fork after externalizing the bad state as it won't be able to reach consensus with the network. When that happens, the Horizon node (or any dependency), that consumes meta from that core instance will have ingested the bad data from a ledger that does not match the overall network state, and in turn other applications in the ecosystem may perform certain tasks with the expectation of finality of transactions - this is basically impossible to recover from in a sane way.
The solution is to adopt a new operating mode, driven by a configuration element
UNSAFE_EMIT_META
(with a default value oftrue
for now), that limits the scope of that 1 ledger fork to the core node and does not allow the information to "leak out" to other parts of the system. The downside is to have a delay of 1 ledger in downstream systems (so perceived latency is longer).We probably need to talk to the Horizon team a bit to make sure the transition to "safe meta" works as transparently as possible.
In a nutshell, when that mode is enabled:
UNSAFE_EMIT_META=true
. I am not sure this is really needed though as trying to ingest the "latest" ledger outside of "online" mode probably doesn't make sense (but it may be needed for testing).