Synchronization issues when driving core with manual-close from horizon

graydon commented 3 years ago

Three separate issues emerged when Horizon (specifically @2opremio -- see also https://github.com/stellar/go/pull/3144#issuecomment-718078768) attempted to drive core in non-standalone manual-close more (realtime but pushing it ahead manually to accelerate testing) on a test network:

Core starts from genesis with an upgrade to the most-recent protocol, and if the manual-close endpoint is hit before the upgrade succeeds the protocol remains locked at v0 indefinitely. This can be worked around by waiting in horizon for the observation of a nonzero protocol version, but it's evidently not ideal, they'd prefer to just manually close immediately.
Starting a second, captive core and attempting to sync the captive core to the non-captive requires pushing the first non-captive core through to a checkpoint-publish event. This is expected. But what's not expected is that if the manual-close endpoint is hit repeatedly to push the non-captive core through to publishing a checkpoint, a pause must be inserted at ledger 63 in order to provoke the non-captive core to publish at all. If pushing straight past it at full speed, publishing evidently stalls or fails or occur. This might have to do with the queueing and asynchronous scheduling behaviour of the publishing system.
Finally, when pushing the non-captive core forwards, the captive core often falls out of sync altogether. We speculate that this is happening because by repeatedly hitting the manual-close endpoint, ledgers are being generated with strictly-increasing timestamps, even though they are (in realtime) all happening within the same second, and these increasing-timestamp ledgers are judged to be invalid (relative to realtime) in the receiving captive core, which then discards them and judges itself to be out of sync. We speculate this because when the captive core loses sync, it emits an SCP state that consists of consensus rounds that all have close time exactly 1 second ahead of the previous close time, and are all marked "validated": false, as visible in the following loss-of-sync message (occurring on the captive core):

GATSN [Herder WARNING] Lost track of consensus
GATSN [Herder WARNING] Out of sync context: {
   "queue" : null,
   "scp" : {
      "54" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 54 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 54 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 54 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 54 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 54 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 54 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 54 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "55" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 55 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 55 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 55 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 55 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 55 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 55 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 55 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 63b08f, ct: 1603903400, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "56" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 56 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 56 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 56 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 56 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 56 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 56 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 56 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 9ff878, ct: 1603903401, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "57" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 57 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 57 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 57 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 57 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 57 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 57 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 57 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: d204c3, ct: 1603903402, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "58" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 58 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 58 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 58 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 58 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 58 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 58 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 58 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 393a38, ct: 1603903403, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "59" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 59 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 59 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 59 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 59 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 59 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 59 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 59 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: e1ddfe, ct: 1603903404, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "60" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 60 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 60 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 60 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 60 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 60 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 60 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 60 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: c7fac9, ct: 1603903405, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "61" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 61 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 61 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 61 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 61 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 61 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 61 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 61 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 696dbb, ct: 1603903406, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "62" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 62 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 62 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 62 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 62 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 62 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 62 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 62 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 77c03a, ct: 1603903407, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "63" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 63 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 63 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 63 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 63 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 63 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 63 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 63 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: cb4e6f, ct: 1603903408, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "64" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 64 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 64 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 64 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 64 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 64 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 64 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 64 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: af6fe0, ct: 1603903409, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      },
      "65" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 65 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 65 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 65 | PREPARE | D: d05a3b | b: (0,[:empty:]) | p: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 0 | h.n: 0 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 65 | PREPARE | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p': (<null_ballot>) | c.n: 4294967295 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 65 | CONFIRM | D: d05a3b | b: (4294967295,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | p.n: 4294967295 | c.n: 1 | h.n: 4294967295 }",
               false
            ],
            [
               1603903357,
               "{ENV@GATSN |  i: 65 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]) | h.n: 4294967295 | (lastD): d05a3b }",
               false
            ],
            [
               1603903357,
               "{ENV@local_core |  i: 65 | NOMINATE | D: d05a3b | X: {'[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]'} | Y: {'[ SIGNED@local_core txH: 0f3244, ct: 1603903410, upgrades: [ ] ]'} }",
               false
            ]
         ],
         "validated" : false
      }
   },
   "you" : "GATSN"
}

rokopt commented 3 years ago

Apparently there is a fourth problem that Horizon has seen:

Target ledger 1 is not newer than last closed ledger 1 - nothing to do

I'm trying to work out which command line / config caused that. It's filed as https://github.com/stellar/go/issues/3157 .

MonsieurNicolas commented 3 years ago

A few questions: what does "Core starts from genesis with an upgrade to the most-recent protocol" means? Via HTTP it's possible to arm a single upgrade (that can change multiple things), that is supposed to be applied at the next ledger close. This could be a bug that "manual close" for some reason eats upgrades, but as this is all coming from external test harness code I'd like to understand if this is really the case (I remember bugs in stellar commander related to this: we were calling "upgrades" multiple times, but the problem is that we have a "last call wins" type of situation).

Before looking into the two other issues, that seem related to the use of "real time". I'd like to make sure that this makes sense as real time will cause a bunch of races/weird behaviors and just non determinism in general (especially if multiple processes are involved). Also I would prefer to avoid adding a bunch of special cases to override production behavior that just requires more testing for both core and people outside of core. In the context of testing, I'd imagine that the goal would be to get 100% determinism and production behavior fidelity, which can only be achieved in virtual time: for example, you can close ledgers and have exact match with whatever happens because everything will be deterministic, such as apply order and ledger hashes.

rokopt commented 3 years ago

There is apparently a fifth issue that turned up during this test:

https://github.com/stellar/go/issues/3155

It was after working around that that they ran into "Target ledger 1 is not newer than last closed ledger 1" (Target ledger 1 is not newer than last closed ledger 1", as filed in https://github.com/stellar/go/issues/3157).

rokopt commented 3 years ago

A few questions: what does "Core starts from genesis with an upgrade to the most-recent protocol" means? Via HTTP it's possible to arm a single upgrade (that can change multiple things), that is supposed to be applied at the next ledger close. This could be a bug that "manual close" for some reason eats upgrades, but as this is all coming from external test harness code I'd like to understand if this is really the case (I remember bugs in stellar commander related to this: we were calling "upgrades" multiple times, but the problem is that we have a "last call wins" type of situation).

Right, this Horizon code is hitting the upgrades endpoint:

https://github.com/stellar/docker-stellar-core-horizon/blob/master/start#L306-L321

(They're setting upgradetime to the Unix epoch.)

I'd expect the upgrade to take place only after the next (manual, in this case) ledger close, since upgrades just "arms" the upgrades. However, the apparent bug is that if they hit the manualclose endpoint (from another thread) before hitting the upgrades endpoint with the code above, then the upgrade to the latest protocol version would never happen, even after further (manual) ledger closes.

There might be some reason that that's expected, but I haven't found it yet. There is code in the core that removes old upgrades:

Upgrades::UpgradeParameters
Upgrades::removeUpgrades(std::vector<UpgradeType>::const_iterator beginUpdates,
                         std::vector<UpgradeType>::const_iterator endUpdates,
                         uint64_t closeTime, bool& updated)
{
    updated = false;
    UpgradeParameters res = mParams;

    // If the upgrade time has been surpassed by more than X hours, then remove
    // all upgrades.  This is done so nodes that come up with outdated upgrades
    // don't attempt to change the network
    if (res.mUpgradeTime + Upgrades::UPDGRADE_EXPIRATION_HOURS <=
        VirtualClock::from_time_t(closeTime))
    {
        auto resetParamIfSet = [&](optional<uint32>& o) {
            if (o)
            {
                o.reset();
                updated = true;
            }
        };

So that might be expected, but I'm not certain because I haven't been able to reproduce the bug. If that is what's happening, then I think Horizon should just set the upgradetime to the current time (I suggested that they try that).

Before looking into the two other issues, that seem related to the use of "real time". I'd like to make sure that this makes sense as real time will cause a bunch of races/weird behaviors and just non determinism in general (especially if multiple processes are involved). Also I would prefer to avoid adding a bunch of special cases to override production behavior that just requires more testing for both core and people outside of core. In the context of testing, I'd imagine that the goal would be to get 100% determinism and production behavior fidelity, which can only be achieved in virtual time: for example, you can close ledgers and have exact match with whatever happens because everything will be deterministic, such as apply order and ledger hashes.

I agree that VIRTUAL_TIME would be more deterministic -- though since they're building a network, we in Core would have to expose a way for them to get into it; currently the only way would be for them to configure RUN_STANDALONE as well as the already-configured MANUALCLOSE, and of course that would prevent them from connecting to any peers. Also, they want in general to make this test run under conditions as close to production as they can manage, so they might dislike VIRTUAL_TIME for that reason.

MonsieurNicolas commented 3 years ago

I was not aware that it was a goal to have a network of core nodes. In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple: ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING is the way to go. Anything else is going to compromise too much on behavior and/or will require quite a bit of machinery on the test harness front to ensure that everything is working as expected.

If we're going to depart from normal behavior, I'd rather see us expose different options/commands to facilitate efficient integration testing by downstream systems. There is no reason to use multiple core nodes in most test cases, and instead just focus on the contracts: history archive and meta generation - we already do this in some of our own unit tests.

rokopt commented 3 years ago

I was not aware that it was a goal to have a network of core nodes. In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple: ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING is the way to go. Anything else is going to compromise too much on behavior and/or will require quite a bit of machinery on the test harness front to ensure that everything is working as expected.

Thanks, I was not aware of ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING! I'll suggest that to Horizon.

rokopt commented 3 years ago

Core starts from genesis with an upgrade to the most-recent protocol, and if the manual-close endpoint is hit before the upgrade succeeds the protocol remains locked at v0 indefinitely. This can be worked around by waiting in horizon for the observation of a nonzero protocol version, but it's evidently not ideal, they'd prefer to just manually close immediately.

I wonder if the cause of this is that there can be a race where:

The first manualclose is issued and triggers the next ledger
Before that ledger is externalized, the upgrades is issued (with upgradetime==epoch)
The first post-genesis ledger is externalized, during the course of which, removeUpgrades() is called
The new closeTime is the current wall clock time, so removeUpgrades() considers the current upgrades to be old and clears them

If so, then that might just be a natural race resulting from the way that Horizon is/was calling manualclose and upgrades from different processes without synchronization. I'm not sure of that, though; I still haven't been able to trigger the race (if it exists) on my own dev box.

2opremio commented 3 years ago

If so, then that might just be a natural race resulting from the way that Horizon is/was calling manualclose and upgrades from different processes without synchronization. I'm not sure of that, though; I still haven't been able to trigger the race (if it exists) on my own dev box.

Yes, there is a race between the /upgrades call and the first GET /manualclose I didn't know about before this issue was created (the upgrade issued by the stellar/quickstart container bootup machinery and GET /manualclose by the integration test itself). We should solve it by synchronizing both.

So, we can forget about this problem, but the other two still stand AFIU.

In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple: ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING is the way to go.

How much faster is this compared to real-time?

With MANUAL_CLOSE (without Captive Core, which is one of the scenarios to test and doesn't suffer from the issues discussed here) the testing time goes down from 13 min to 5 minutes (including the test setup). See https://github.com/stellar/go/pull/3167

rokopt commented 3 years ago

In that case, to have one integration test that is as close as possible to production (with several core instances talking to each other), then it's simple: ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING is the way to go.

How much faster is this compared to real-time?

With MANUAL_CLOSE (without Captive Core, which is one of the scenarios to test and doesn't suffer from the issues discussed here) the testing time goes down from 13 min to 5 minutes (including the test setup). See stellar/go#3167

ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING drops the time between ledger closes from ~5s to ~1s, and the number of ledgers between checkpoints from 64 to 8.

rokopt commented 3 years ago

Starting a second, captive core and attempting to sync the captive core to the non-captive requires pushing the first non-captive core through to a checkpoint-publish event. This is expected. But what's not expected is that if the manual-close endpoint is hit repeatedly to push the non-captive core through to publishing a checkpoint, a pause must be inserted at ledger 63 in order to provoke the non-captive core to publish at all. If pushing straight past it at full speed, publishing evidently stalls or fails or occur. This might have to do with the queueing and asynchronous scheduling behaviour of the publishing system.

I think it is -- we update the HISTORY_PUBLISH status in HistoryManagerImpl::logAndUpdatePublishStatus(), which we call from LedgerManagerImpl::closeLedger(). So after a publication status change, in particular in a worker thread, the info endpoint's "status" won't reflect the new status until after the next ledger close. If we manualclose enough in a row to push past a checkpoint, but the last few race in before publishing finishes, then we'll still have a "publishing" status until the next manualclose. Waiting at ledger 63 is probably allowing the publishing to complete so that the close of ledger 64 updates the status to reflect that it has completed.

Beyond that specific problem, it's not actually obvious to me why it's safe to use mPublishWork without any synchronization in HistoryManagerImpl::logAndUpdatePublishStatus() when it can be reset() in a worker thread (in HistoryManagerImpl::historyPublished()), but I am unfamiliar with the architecture so I could easily be missing something.

rokopt commented 3 years ago

Finally, when pushing the non-captive core forwards, the captive core often falls out of sync altogether.

It has taken me an embarrassingly long time to realize that this is because HerderImpl::broadcast() is a no-op if MANUAL_CLOSE is set! So a manually-closed ledger will never update watchers to tell them to externalize new values. The only reason that the watcher node ever gets a new ledger is that it eventually loses track of consensus and sends an explicit GET_SCP_STATE, to which the MANUAL_CLOSE validator replies, thus bringing the watcher back into sync.

That's why the logs always show the next ledger closes on the watcher taking almost exactly 45 seconds -- that's 35 for CONSENSUS_STUCK_TIMEOUT_SECONDS followed by the 10 for the OUT_OF_SYNC_RECOVERY_TIMER which the loss of consensus triggers.

rokopt commented 3 years ago

It has taken me an embarrassingly long time to realize that this is because HerderImpl::broadcast() is a no-op if MANUAL_CLOSE is set!

I'm not sure whether that's actually necessary, or just an optimization -- and I don't think that a slight optimization out of a usually-local message in a MANUAL_CLOSE configuration is particularly important. If it isn't necessary, then we might be able to just take that out. Or take it out only if both MANUAL_CLOSE and RUN_STANDALONE are set, the same as with HerderImpl::startOutOfSyncTimer().

rokopt commented 3 years ago

Beyond that specific problem, it's not actually obvious to me why it's safe to use mPublishWork without any synchronization in HistoryManagerImpl::logAndUpdatePublishStatus() when it can be reset() in a worker thread (in HistoryManagerImpl::historyPublished()), but I am unfamiliar with the architecture so I could easily be missing something.

Aha -- what I didn't realize is that HistoryManagerImpl::historyPublished() actually runs on the main thread, not a background thread. So I think this is okay.

2opremio commented 3 years ago

I'm not sure whether that's actually necessary, or just an optimization -- and I don't think that a slight optimization out of a usually-local message in a MANUAL_CLOSE configuration is particularly important. If it isn't necessary, then we might be able to just take that out. Or take it out only if both MANUAL_CLOSE and RUN_STANDALONE are set, the same as with HerderImpl::startOutOfSyncTimer().

Nice, I am eager to test this!

rokopt commented 3 years ago

I did commit the HerderImpl::broadcast() change to a private branch of mine in https://github.com/rokopt/stellar-core/commit/fc78d620739f236f621238ce767ea5c9db60f2b4; it seems to work in my testing at least.

Even if it gets merged, though, I suspect that ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING will allow much cleaner test code than MANUAL_CLOSE.

rokopt commented 3 years ago

Another disadvantage of MANUAL_CLOSE is that you'll still fall out of sync on the captive core if you don't issue a manualclose on the captive core for 35 seconds (CONSENSUS_STUCK_TIMEOUT_SECONDS).

2opremio commented 3 years ago

I think that's fine when testing.

rokopt commented 3 years ago

Apparently there is a fourth problem that Horizon has seen:

Target ledger 1 is not newer than last closed ledger 1 - nothing to do

I'm trying to work out which command line / config caused that. It's filed as stellar/go#3157 .

It was --start-at-ledger 1 (with a corresponding start-at-hash).

I think we can straightforwardly tolerate --start-at-ledger 1 if we want to, as in https://github.com/rokopt/stellar-core/commit/3cea6ee9ec5bc08495347e0aebf14f9dd6acb9d4.

2opremio commented 3 years ago

@rokopt are you going to create PRs with the fixes you found? (I am wondering when they will be released, if they are released)

rokopt commented 3 years ago

@rokopt are you going to create PRs with the fixes you found? (I am wondering when they will be released, if they are released)

I can do that -- I didn't know whether they'd actually worked for you. Have they resolved the issues that I hoped they would?

Also, how does the speed of using manualclose with those fixes compare to using ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING?

MonsieurNicolas commented 3 years ago

As I mentioned earlier for those setups with multiple nodes, I would rather fix issues that come up while adopting ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING than fixing those other problems as we're diverging more and more from the behavior in production (which makes testing less interesting).

MonsieurNicolas commented 2 years ago

I think this issue can be closed: we're not going to invest in mechanisms that allow to "drive the clock" on a distributed network, and for standalone I think none of this is relevant at this point

stellar / stellar-core

Synchronization issues when driving core with manual-close from horizon #2778