horizon: enable faster catchup after large history gap or offline period

What problem does your feature solve?

If you have a horizon instance that has been offline for a long period of time, and then you try and turn it back on, you'll have a large gap between horizon's last ingested ledger and the network's LCL. Horizon will attempt to do forward ingestion normally from that point forward, but this can take awhile. We did this internally, and it took about ~1 day to catch up with a 2 week gap, and we've seen a partner have a larger gap (~4 months) and want to do the same. For these longer periods of catchup, we'd like to recommend people to run reingest instead, since it's typically faster than live ingestion.

The problem is that, reingestion assumes that we won't reingest past the last ingested ledger in horizon and prevents reingestion past this point, as a protection mechanism against reingestion and live ingestion overwriting one another. Further, reingestion uses the value of the exp_ingest_last_ledger key in key_value_store table to discern whether or not live ingestion is running. In this particular use-case, live/forward reingestion could be off but there would still be a non-zero value in this table. If this is the case, it is "safe" to run reingestion and so we should allow that to happen.

It is possible to workaround this by using the --force parameter, but that will cause reingestion to grab an exclusive lock to the db, making it so that only a single parallel worker can be used, thus defeating the purpose of getting a "faster" catchup/ingestion.

What would you like to see?

A clear, documented mechanism that let's an operator run offline reingestion to catch back up to the LCL after their horizon has been offline for awhile.

There are several different ways we could go about this, all requiring different effort and providing different UX. We should discuss here what is best. Some options are:

Document the manual/hacky workaround for this specific issue (ex. update key_value_store...) | effort: low
Add a new CLI command/option that wraps the above hacky workaround and makes sure ingestion is not running when it runs, to make it slightly more user friendly
Make it "safe" to run live/forward ingestion and reingestion at the same time. How would we do this?
Create a different mechanism by which we can tell if live/forward ingestion is running that is more accurate than querying for theexp_ingest_last_ledger key. This would allow reingest to proceed even if there was a non-zero value in here
Something else?

stellar / go

horizon: enable faster catchup after large history gap or offline period #5292

What problem does your feature solve?

What would you like to see?