Open graydon opened 3 years ago
Apparently there is a fourth problem that Horizon has seen:
Target ledger 1 is not newer than last closed ledger 1 - nothing to do
I'm trying to work out which command line / config caused that. It's filed as https://github.com/stellar/go/issues/3157 .
A few questions: what does "Core starts from genesis with an upgrade to the most-recent protocol" means? Via HTTP it's possible to arm a single upgrade (that can change multiple things), that is supposed to be applied at the next ledger close. This could be a bug that "manual close" for some reason eats upgrades, but as this is all coming from external test harness code I'd like to understand if this is really the case (I remember bugs in stellar commander related to this: we were calling "upgrades" multiple times, but the problem is that we have a "last call wins" type of situation).
Before looking into the two other issues, that seem related to the use of "real time". I'd like to make sure that this makes sense as real time will cause a bunch of races/weird behaviors and just non determinism in general (especially if multiple processes are involved). Also I would prefer to avoid adding a bunch of special cases to override production behavior that just requires more testing for both core and people outside of core. In the context of testing, I'd imagine that the goal would be to get 100% determinism and production behavior fidelity, which can only be achieved in virtual time: for example, you can close ledgers and have exact match with whatever happens because everything will be deterministic, such as apply order and ledger hashes.
There is apparently a fifth issue that turned up during this test:
https://github.com/stellar/go/issues/3155
It was after working around that that they ran into "Target ledger 1 is not newer than last closed ledger 1" (Target ledger 1 is not newer than last closed ledger 1", as filed in https://github.com/stellar/go/issues/3157).
A few questions: what does "Core starts from genesis with an upgrade to the most-recent protocol" means? Via HTTP it's possible to arm a single upgrade (that can change multiple things), that is supposed to be applied at the next ledger close. This could be a bug that "manual close" for some reason eats upgrades, but as this is all coming from external test harness code I'd like to understand if this is really the case (I remember bugs in stellar commander related to this: we were calling "upgrades" multiple times, but the problem is that we have a "last call wins" type of situation).
Right, this Horizon code is hitting the upgrades
endpoint:
https://github.com/stellar/docker-stellar-core-horizon/blob/master/start#L306-L321
(They're setting upgradetime
to the Unix epoch.)
I'd expect the upgrade to take place only after the next (manual, in this case) ledger close, since upgrades
just "arms" the upgrades. However, the apparent bug is that if they hit the manualclose
endpoint (from another thread) before hitting the upgrades
endpoint with the code above, then the upgrade to the latest protocol version would never happen, even after further (manual) ledger closes.
There might be some reason that that's expected, but I haven't found it yet. There is code in the core that removes old upgrades:
Upgrades::UpgradeParameters
Upgrades::removeUpgrades(std::vector<UpgradeType>::const_iterator beginUpdates,
std::vector<UpgradeType>::const_iterator endUpdates,
uint64_t closeTime, bool& updated)
{
updated = false;
UpgradeParameters res = mParams;
// If the upgrade time has been surpassed by more than X hours, then remove
// all upgrades. This is done so nodes that come up with outdated upgrades
// don't attempt to change the network
if (res.mUpgradeTime + Upgrades::UPDGRADE_EXPIRATION_HOURS <=
VirtualClock::from_time_t(closeTime))
{
auto resetParamIfSet = [&](optional<uint32>& o) {
if (o)
{
o.reset();
updated = true;
}
};
So that might be expected, but I'm not certain because I haven't been able to reproduce the bug. If that is what's happening, then I think Horizon should just set the upgradetime
to the current time (I suggested that they try that).
Before looking into the two other issues, that seem related to the use of "real time". I'd like to make sure that this makes sense as real time will cause a bunch of races/weird behaviors and just non determinism in general (especially if multiple processes are involved). Also I would prefer to avoid adding a bunch of special cases to override production behavior that just requires more testing for both core and people outside of core. In the context of testing, I'd imagine that the goal would be to get 100% determinism and production behavior fidelity, which can only be achieved in virtual time: for example, you can close ledgers and have exact match with whatever happens because everything will be deterministic, such as apply order and ledger hashes.
I agree that VIRTUAL_TIME
would be more deterministic -- though since they're building a network, we in Core would have to expose a way for them to get into it; currently the only way would be for them to configure RUN_STANDALONE
as well as the already-configured MANUALCLOSE
, and of course that would prevent them from connecting to any peers. Also, they want in general to make this test run under conditions as close to production as they can manage, so they might dislike VIRTUAL_TIME
for that reason.
I was not aware that it was a goal to have a network of core nodes.
In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple: ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
is the way to go.
Anything else is going to compromise too much on behavior and/or will require quite a bit of machinery on the test harness front to ensure that everything is working as expected.
If we're going to depart from normal behavior, I'd rather see us expose different options/commands to facilitate efficient integration testing by downstream systems. There is no reason to use multiple core nodes in most test cases, and instead just focus on the contracts: history archive and meta generation - we already do this in some of our own unit tests.
I was not aware that it was a goal to have a network of core nodes. In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple:
ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
is the way to go. Anything else is going to compromise too much on behavior and/or will require quite a bit of machinery on the test harness front to ensure that everything is working as expected.
Thanks, I was not aware of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
! I'll suggest that to Horizon.
Core starts from genesis with an upgrade to the most-recent protocol, and if the manual-close endpoint is hit before the upgrade succeeds the protocol remains locked at v0 indefinitely. This can be worked around by waiting in horizon for the observation of a nonzero protocol version, but it's evidently not ideal, they'd prefer to just manually close immediately.
I wonder if the cause of this is that there can be a race where:
manualclose
is issued and triggers the next ledgerupgrades
is issued (with upgradetime==epoch
)removeUpgrades()
is calledcloseTime
is the current wall clock time, so removeUpgrades()
considers the current upgrades to be old and clears themIf so, then that might just be a natural race resulting from the way that Horizon is/was calling manualclose
and upgrades
from different processes without synchronization. I'm not sure of that, though; I still haven't been able to trigger the race (if it exists) on my own dev box.
If so, then that might just be a natural race resulting from the way that Horizon is/was calling
manualclose
andupgrades
from different processes without synchronization. I'm not sure of that, though; I still haven't been able to trigger the race (if it exists) on my own dev box.
Yes, there is a race between the /upgrades
call and the first GET /manualclose
I didn't know about before this issue was created (the upgrade issued by the stellar/quickstart
container bootup machinery and GET /manualclose
by the integration test itself). We should solve it by synchronizing both.
So, we can forget about this problem, but the other two still stand AFIU.
In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple:
ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
is the way to go.
How much faster is this compared to real-time?
With MANUAL_CLOSE
(without Captive Core, which is one of the scenarios to test and doesn't suffer from the issues discussed here) the testing time goes down from 13 min to 5 minutes (including the test setup). See https://github.com/stellar/go/pull/3167
In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple:
ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
is the way to go.How much faster is this compared to real-time?
With
MANUAL_CLOSE
(without Captive Core, which is one of the scenarios to test and doesn't suffer from the issues discussed here) the testing time goes down from 13 min to 5 minutes (including the test setup). See stellar/go#3167
ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
drops the time between ledger closes from ~5s
to ~1s
, and the number of ledgers between checkpoints from 64
to 8
.
Starting a second, captive core and attempting to sync the captive core to the non-captive requires pushing the first non-captive core through to a checkpoint-publish event. This is expected. But what's not expected is that if the manual-close endpoint is hit repeatedly to push the non-captive core through to publishing a checkpoint, a pause must be inserted at ledger 63 in order to provoke the non-captive core to publish at all. If pushing straight past it at full speed, publishing evidently stalls or fails or occur. This might have to do with the queueing and asynchronous scheduling behaviour of the publishing system.
I think it is -- we update the HISTORY_PUBLISH
status in HistoryManagerImpl::logAndUpdatePublishStatus()
, which we call from LedgerManagerImpl::closeLedger()
. So after a publication status change, in particular in a worker thread, the info
endpoint's "status"
won't reflect the new status until after the next ledger close. If we manualclose
enough in a row to push past a checkpoint, but the last few race in before publishing finishes, then we'll still have a "publishing"
status until the next manualclose
. Waiting at ledger 63 is probably allowing the publishing to complete so that the close of ledger 64 updates the status to reflect that it has completed.
Beyond that specific problem, it's not actually obvious to me why it's safe to use mPublishWork
without any synchronization in HistoryManagerImpl::logAndUpdatePublishStatus()
when it can be reset()
in a worker thread (in HistoryManagerImpl::historyPublished()
), but I am unfamiliar with the architecture so I could easily be missing something.
Finally, when pushing the non-captive core forwards, the captive core often falls out of sync altogether.
It has taken me an embarrassingly long time to realize that this is because HerderImpl::broadcast()
is a no-op if MANUAL_CLOSE
is set! So a manually-closed ledger will never update watchers to tell them to externalize new values. The only reason that the watcher node ever gets a new ledger is that it eventually loses track of consensus and sends an explicit GET_SCP_STATE
, to which the MANUAL_CLOSE
validator replies, thus bringing the watcher back into sync.
That's why the logs always show the next ledger closes on the watcher taking almost exactly 45 seconds -- that's 35
for CONSENSUS_STUCK_TIMEOUT_SECONDS
followed by the 10
for the OUT_OF_SYNC_RECOVERY_TIMER
which the loss of consensus triggers.
It has taken me an embarrassingly long time to realize that this is because HerderImpl::broadcast() is a no-op if MANUAL_CLOSE is set!
I'm not sure whether that's actually necessary, or just an optimization -- and I don't think that a slight optimization out of a usually-local message in a MANUAL_CLOSE
configuration is particularly important. If it isn't necessary, then we might be able to just take that out. Or take it out only if both MANUAL_CLOSE
and RUN_STANDALONE
are set, the same as with HerderImpl::startOutOfSyncTimer()
.
Beyond that specific problem, it's not actually obvious to me why it's safe to use mPublishWork without any synchronization in HistoryManagerImpl::logAndUpdatePublishStatus() when it can be reset() in a worker thread (in HistoryManagerImpl::historyPublished()), but I am unfamiliar with the architecture so I could easily be missing something.
Aha -- what I didn't realize is that HistoryManagerImpl::historyPublished()
actually runs on the main thread, not a background thread. So I think this is okay.
I'm not sure whether that's actually necessary, or just an optimization -- and I don't think that a slight optimization out of a usually-local message in a
MANUAL_CLOSE
configuration is particularly important. If it isn't necessary, then we might be able to just take that out. Or take it out only if bothMANUAL_CLOSE
andRUN_STANDALONE
are set, the same as withHerderImpl::startOutOfSyncTimer()
.
Nice, I am eager to test this!
I did commit the HerderImpl::broadcast()
change to a private branch of mine in https://github.com/rokopt/stellar-core/commit/fc78d620739f236f621238ce767ea5c9db60f2b4; it seems to work in my testing at least.
Even if it gets merged, though, I suspect that ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
will allow much cleaner test code than MANUAL_CLOSE
.
Another disadvantage of MANUAL_CLOSE
is that you'll still fall out of sync on the captive core if you don't issue a manualclose
on the captive core for 35 seconds (CONSENSUS_STUCK_TIMEOUT_SECONDS
).
I think that's fine when testing.
Apparently there is a fourth problem that Horizon has seen:
Target ledger 1 is not newer than last closed ledger 1 - nothing to do
I'm trying to work out which command line / config caused that. It's filed as stellar/go#3157 .
It was --start-at-ledger 1
(with a corresponding start-at-hash
).
I think we can straightforwardly tolerate --start-at-ledger 1
if we want to, as in https://github.com/rokopt/stellar-core/commit/3cea6ee9ec5bc08495347e0aebf14f9dd6acb9d4.
@rokopt are you going to create PRs with the fixes you found? (I am wondering when they will be released, if they are released)
@rokopt are you going to create PRs with the fixes you found? (I am wondering when they will be released, if they are released)
I can do that -- I didn't know whether they'd actually worked for you. Have they resolved the issues that I hoped they would?
Also, how does the speed of using manualclose
with those fixes compare to using ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
?
As I mentioned earlier for those setups with multiple nodes, I would rather fix issues that come up while adopting ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING
than fixing those other problems as we're diverging more and more from the behavior in production (which makes testing less interesting).
I think this issue can be closed: we're not going to invest in mechanisms that allow to "drive the clock" on a distributed network, and for standalone I think none of this is relevant at this point
Three separate issues emerged when Horizon (specifically @2opremio -- see also https://github.com/stellar/go/pull/3144#issuecomment-718078768) attempted to drive core in non-standalone manual-close more (realtime but pushing it ahead manually to accelerate testing) on a test network:
Core starts from genesis with an upgrade to the most-recent protocol, and if the manual-close endpoint is hit before the upgrade succeeds the protocol remains locked at v0 indefinitely. This can be worked around by waiting in horizon for the observation of a nonzero protocol version, but it's evidently not ideal, they'd prefer to just manually close immediately.
Starting a second, captive core and attempting to sync the captive core to the non-captive requires pushing the first non-captive core through to a checkpoint-publish event. This is expected. But what's not expected is that if the manual-close endpoint is hit repeatedly to push the non-captive core through to publishing a checkpoint, a pause must be inserted at ledger 63 in order to provoke the non-captive core to publish at all. If pushing straight past it at full speed, publishing evidently stalls or fails or occur. This might have to do with the queueing and asynchronous scheduling behaviour of the publishing system.
Finally, when pushing the non-captive core forwards, the captive core often falls out of sync altogether. We speculate that this is happening because by repeatedly hitting the manual-close endpoint, ledgers are being generated with strictly-increasing timestamps, even though they are (in realtime) all happening within the same second, and these increasing-timestamp ledgers are judged to be invalid (relative to realtime) in the receiving captive core, which then discards them and judges itself to be out of sync. We speculate this because when the captive core loses sync, it emits an SCP state that consists of consensus rounds that all have close time exactly 1 second ahead of the previous close time, and are all marked
"validated": false
, as visible in the following loss-of-sync message (occurring on the captive core):