stellar / go

Stellar's public monorepo of go code
https://stellar.org/developers
Apache License 2.0
1.29k stars 503 forks source link

ingest/ledgerbackend: Stopping captive core can be slow #5349

Closed 2opremio closed 3 months ago

2opremio commented 3 months ago

We only cancel the running processes after acquiring the lock (which can be really slow in some cases)

func (c *CaptiveStellarCore) Close() error {
    c.stellarCoreLock.RLock()
    defer c.stellarCoreLock.RUnlock()

    c.closed = true

    // after the CaptiveStellarCore context is canceled all subsequent calls to PrepareRange() will fail
    c.cancel()
    [...]
}

I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.

Related to https://github.com/stellar/go/issues/5347 ?

2opremio commented 3 months ago

For context: I found this during an integration test which spends most of the time waiting for captive core to close

2opremio commented 3 months ago

Uhm, actually moving the cancel() out of the criticial section didn't help.

For more context, core was trying to catchup when closing it. It took 20 seconds to close it.

time="2024-06-16T16:45:17.338+02:00" level=warning msg="Process: process 65163 exited 22: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:17.338+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=warning msg="Process: process 65166 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.717+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=warning msg="Process: process 65170 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:21.733+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.385+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=warning msg="Process: process 65173 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:23.411+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.412+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=warning msg="Process: process 65176 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:28.436+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.438+02:00" level=info msg="History: Downloading history archive state: history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="Process: process 65179 exited 7: curl -sf http://localhost:57043/history/00/00/00/history-00000017.json -o buckets/tmp/history-2ea087fae299fb7b/4abab04a1c772310-stellar-history.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Could not download file: archive h0 maybe missing file history/00/00/00/history-00000017.json" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=error msg="History: Missing HAS for ledger 23: maybe stale archive h0" pid=65148 subservice=stellar-core
time="2024-06-16T16:45:37.464+02:00" level=warning msg="History: Catchup failed" pid=65148 subservice=stellar-core
tamirms commented 3 months ago

I think the issue is that there is a code path where we execute stellar-core catchup before stellar-core run:

https://github.com/stellar/go/blob/a4e5a3fc6664f7cf6b26fb3196b366910563636f/ingest/ledgerbackend/stellar_core_runner.go#L400-L412

In that code path we do not abort the stellar-core catchup command in case the context is canceled. We can fix this issue by constructing the Command instance using https://pkg.go.dev/os/exec#CommandContext and also configuring the Cmd.WaitDelay and Cmd.Cancel properties. Basically, we need to use the same techniques described in https://github.com/stellar/go/issues/5347

tamirms commented 3 months ago

I don't know if it's safe to cancel before acquiring the lock (I bet it is) or if we can reduce the critical sections around the lock.

yes, I think we need to also move the cancel before acquiring the lock

2opremio commented 3 months ago

Great analysis, thanks!