Make captive-core friendlier to on-disk setup

MonsieurNicolas commented 2 years ago

As ledger size grows, captive core's original assumption on memory is being challenged, especially on cheaper SKUs.

The Horizon team has been experimenting with a version of captive core that uses an on disk sqlite setup instead of the in-memory one. The problem with that approach is that there is a pretty good chance that core closes a ledger that was not yet ingested by Horizon (for example, on restart), when this happens, the only recovery path is for Horizon to reset core's state which implies rebuilding the live ledger that is stored in sqlite. That operation on slower instance types can take ~1 hour currently which is problematic.

As we're investigating alternative ways to store the live ledger it's probably a good time to revisit what we're doing in core to ensure things work smoothly with both larger ledgers and larger meta due to the tps increase.

Here are a few ideas to get the discussion started:

require Horizon to "ack" on the meta stream.
- Right now core considers its part done as soon as it wrote the meta, instead it could perform a blocking read on the meta descriptor within emitNextMeta.
- pros:
  - simple
- cons:
  - adds "Horizon ingestion" on the critical path for core. Right now, Horizon has the full 5 seconds to ingest a ledger while the network is reaching consensus, so core and Horizon work in parallel. With this change, the overhead may cause core to lose track of consensus or see large delays from its peers.
    - A potential way to partially mitigate this would be to have Horizon write into fsynced flat files stored locally before acking (and do the actual ingestion in parallel - this is similar to the next idea but done outside of core).
  - (small) breaking change on the Horizon side
  - amplifies issues that come with larger meta
decouple meta generation from emission
- this is a hybrid between what we do with the old txhistory meta and the "captive-core" way
- main idea here is to keep from the old way the separation between "meta generation" from "meta emission" by only writing into a reliable queue the meta from core's critical path and allow Horizon to pick up items from that queue in some way.
  - the "old" txhistory based meta has the following problems:
    - stores the meta in SQL with multiple rows per transaction, causing scalability/maintenance issues.
    - consumers need to "pull" for the right data by performing adhoc SQL queries directly against core's database, causing tight coupling and performance issues.
- pushing to the reliable queue when generating meta
  - can be based off the "debug meta stream"
  - ensures that the N most recent ledgers worth of meta are available for streaming (maximum drift with Horizon)
  - would block core from closing ledgers if the queue has already N ledgers worth of meta
- emit meta
  - can be done from a background thread by directly pushing into the meta stream pipe
- deleting items from the queue
  - this could be done a number of ways, can be a new "trim" http command that Horizon can invoke (similar to an "ack"), this is similar to what cursors do.
- The "starting ledger" specified on the command line, would refer to the next ledger needed by Horizon, core uses the meta stream queue first before resetting its state if it can't "snap back" to the last closed ledger for some reason.
- pros
  - removes the potential block with Horizon as meta gets bigger
  - parallelizes well even on the ingestion side, allows to potentially batch even more (ie: can potentially ingest more than 1 ledger worth of meta)
- cons
  - more work
  - needs to "flush" more data to disk

MonsieurNicolas commented 2 years ago

@ire-and-curses I opened this issue to get ahead of some potential problems with "on disk captive core". Let us know if this makes sense and/or what you already did to mitigate some of those problems

graydon commented 2 years ago

I'm not sure I understand the issue here. Why does restarting core emit things horizon doesn't want? Can we tackle that problem directly? Eg. by having horizon ask core for offline info to learn what the next-to-close ledger will be, and not starting core until it's ready for that ledger?

I should also point out -- in terms of "getting ahead of problems" -- that once we're running "just on buckets" (no SQL database at all) that re-attaching to those buckets will be significantly faster, just a linear read to rebuild the fence-pointer indexes in memory. So it's possible even without modifying horizon that we'll be able to make that "1 hour" experience of bucket-apply a slow SKU go back down to "seconds or minutes" for bucket-attach. Even with a lousy SKU doing linear reads at (say) 100MB/s, attaching to 10GB of buckets should only take 100s.

Concerning suggested changes here: I'm very hesitant to change the synchronization contract of the meta-emitting interface. We made the meta-emitting interface synchronous (with the write-completion + flush as the synchronization event) for a reason: to avoid having to co-design an asynchronous buffering system that spanned both core and horizon.

Anyone reading from the meta pipe can buffer whatever they want, however they want; they just have to keep up with reading bytes synchronously -- the weakest synchronization requirement possible for both parties to the relationship. Any more-involved contract will almost certainly introduce constraints and failure modes for the liveness of both parties.

paulbellamy commented 2 years ago

Linking back to the referenced work: https://github.com/stellar/go/issues/4038

paulbellamy commented 2 years ago

Given there's (currently) no way to get the offline info from core, we reset the sqlite db by calling a new-db each time, before launching the normal captive-core process. Then we catchup to fast-forward to the last point horizon knew about. And finally, we do the normal stellar-core run.

https://github.com/stellar/go/blob/master/ingest/ledgerbackend/stellar_core_runner.go#L321-L338

sreuland commented 2 years ago

Hello, I worked with Paul on the horizon integration(https://github.com/stellar/go/pull/4092) to invoke captive core with on-disk sqlite, observed that sequence of stellar-core invocations he mentioned, afterwards, was curious on potential of other interfaces to tx meta and see some mentioned here.

wanted to contribute another idea under 'decouple meta generation from emission' for captive core use cases, what if there was a stellar-core service-daemon run mode, which starts a process that doesn't do anything initially, just exposes a TCP socket based service interface such as GRPC and clients then interact via request->response and request->stream.

Idea is rather than clients invoking stellar-core from command line 3 times new-db;catchup;run and watching a pipe and assuming it's output is associated to those commands, instead run core in service-daemon mode on o/s, and client process submits GRPC request over TCP like request_meta_stream(start_ledger=x, stop_ledger=y) and the GRPC request processing thread in core process that gets invoked becomes the async meta tx emission thread such as 'emit meta' suggestion references and it fires off the internal core framework for processing network side resources for the requested ledger range and then captures generated tx meta and pushes it back out the stream response.

GRPC provides some notion of back-pressure with the client via the tcp buffer, so this request thread will see a signal from GRPC if client has fallen too far behind reading the stream, at which point request thread can choose to drop further msgs or the connection.

Other existing cli params could potentially be exposed over service-daemon GRPC request-response routes, allowing clients to remotely interact with other aspects of core programatically if that were worthwhile, like offline-info or other actions.

MonsieurNicolas commented 2 years ago

I'm not sure I understand the issue here. Why does restarting core emit things horizon doesn't want? Can we tackle that problem directly? Eg. by having horizon ask core for offline info to learn what the next-to-close ledger will be, and not starting core until it's ready for that ledger?

The issue is that core closes ledgers independently of Horizon's ingestion and as a consequence can be ahead of what Horizon committed to its database when things got shutdown. When this happens the only way to recover from this situation is to clear core's state.

From @paulbellamy 's comment it looks like Horizon always clears core's database all the time.

We should merge https://github.com/stellar/stellar-core/pull/3326 to at least not get that hit when Horizon was shutdown "in sync" with core.

I also added https://github.com/stellar/stellar-core/issues/3353 to our backlog as calling new-db is extremely slow.

What will be left is what this issue is tracking: killing Horizon before it ingests the latest ledger. It's possible that this can be mostly mitigated from the Horizon side by killing core first before shutting down ingestion (so that Horizon gets to read and ingest the meta data that sits in the pipe). It will still be possible (like if Horizon crashes) to leave Horizon behind core but that might be rare enough.

I should also point out -- in terms of "getting ahead of problems" -- that once we're running "just on buckets" (no SQL database at all) that re-attaching to those buckets will be significantly faster, just a linear read to rebuild the fence-pointer indexes in memory. So it's possible even without modifying horizon that we'll be able to make that "1 hour" experience of bucket-apply a slow SKU go back down to "seconds or minutes" for bucket-attach. Even with a lousy SKU doing linear reads at (say) 100MB/s, attaching to 10GB of buckets should only take 100s.

I guess this depends on two things:

when is Horizon planning to go live with this "sqlite" version. Core's "buckets" version is still a few months out.
not addressing this issue still leaves core "ahead" of what Horizon wants, which requires both waiting for the next checkpoint and download (potentially large) buckets.

sreuland commented 2 years ago

when is Horizon planning to go live with this "sqlite" version. Core's "buckets" version is still a few months out.

Horizon 2.15.0 includes this as new option, --captive-core-use-db, disabled by default, but when enabled will invoke captive core with "sqlite" disk mode during ingestion, current Horizon prod/AWS deployment does not use the option yet and is still invoking captive core with --in-memory.

One observation we noticed while testing the horizon 2.15.0 release with "sqlite" disk mode of captive core during ingestion in horizon and deployed on AWS/EC2 staging environment, the storage device that has captive core's DATABASE(sqlite3 files), needs to support IOPS with a minimum Write Operations/second of 3k. Initially, EC2 that was running horizon ingestion with captive-core "sqlite" disk enabled and the sqlite3 files located on a cloud/network-attached volume with a smaller IOPS of around 100, during which we saw captive core output slow down, not much log output from it, it wasn't getting through catchup. @paulbellamy noticed the Write Ops on the volume was pegged in metrics, and @jacekn increased that volume to 3k(related slack chat) and then captive core resumed normal output.

We noted this in a recommendation when enabling the feature in 2.15.1 release notes

ire-and-curses commented 2 years ago

when is Horizon planning to go live with this "sqlite" version. Core's "buckets" version is still a few months out.

The reason for this feature in Horizon is to provide an "escape hatch" for installed clients at the point that they run of RAM due to Stellar's ongoing ledger growth. We aren't going to enable it by default in the immediate future, but we do want to be able to suggest this as a mitigation when users with 16GB machines run out of free memory.

stellar / stellar-core

Make captive-core friendlier to on-disk setup #3354