Closed 2opremio closed 3 months ago
For context: I found this during an integration test which spends most of the time waiting for captive core to close
Uhm, actually moving the cancel()
out of the criticial section didn't help.
For more context, core was trying to catchup
when closing it. It took 20 seconds to close it.
time="2024-06-16T16:45:17.338+02:00" level=warning msg="Process: process 65163 exited 22: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=warning msg="Process: process 65166 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=warning msg="Process: process 65170 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.385+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=warning msg="Process: process 65173 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.412+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=warning msg="Process: process 65176 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.438+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="Process: process 65179 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="History: Catchup failed" pid=65148 subservice=stellar-core
I think the issue is that there is a code path where we execute stellar-core catchup
before stellar-core run
:
In that code path we do not abort the stellar-core catchup
command in case the context is canceled. We can fix this issue by constructing the Command
instance using https://pkg.go.dev/os/exec#CommandContext and also configuring the Cmd.WaitDelay and Cmd.Cancel properties. Basically, we need to use the same techniques described in
https://github.com/stellar/go/issues/5347
I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.
yes, I think we need to also move the cancel before acquiring the lock
Great analysis, thanks!
We only cancel the running processes after acquiring the lock (which can be really slow in some cases)
I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.
Related to https://github.com/stellar/go/issues/5347 ?