stellar / go

Stellar's public monorepo of go code
https://stellar.org/developers
Apache License 2.0
1.3k stars 502 forks source link

horizon: enable faster catchup after large history gap or offline period #5292

Open mollykarcher opened 5 months ago

mollykarcher commented 5 months ago

What problem does your feature solve?

If you have a horizon instance that has been offline for a long period of time, and then you try and turn it back on, you'll have a large gap between horizon's last ingested ledger and the network's LCL. Horizon will attempt to do forward ingestion normally from that point forward, but this can take awhile. We did this internally, and it took about ~1 day to catch up with a 2 week gap, and we've seen a partner have a larger gap (~4 months) and want to do the same. For these longer periods of catchup, we'd like to recommend people to run reingest instead, since it's typically faster than live ingestion.

The problem is that, reingestion assumes that we won't reingest past the last ingested ledger in horizon and prevents reingestion past this point, as a protection mechanism against reingestion and live ingestion overwriting one another. Further, reingestion uses the value of the exp_ingest_last_ledger key in key_value_store table to discern whether or not live ingestion is running. In this particular use-case, live/forward reingestion could be off but there would still be a non-zero value in this table. If this is the case, it is "safe" to run reingestion and so we should allow that to happen.

It is possible to workaround this by using the --force parameter, but that will cause reingestion to grab an exclusive lock to the db, making it so that only a single parallel worker can be used, thus defeating the purpose of getting a "faster" catchup/ingestion.

What would you like to see?

A clear, documented mechanism that let's an operator run offline reingestion to catch back up to the LCL after their horizon has been offline for awhile.

There are several different ways we could go about this, all requiring different effort and providing different UX. We should discuss here what is best. Some options are:

mollykarcher commented 4 months ago

Note for anyone encountering this issue, the manual workaround is as follows: