services/horizon/ingest/filtering: Backfill low level design

sreuland commented 2 years ago

What problem does your feature solve?

Design proposal for backfill behavior of filters.

What would you like to see?

A design that defines user experience with filter rule changes and backfill behaviors in horizon. Low level design writeup for filtered data range behavior.

Acceptance Criteria:

Low level design that provides dev path for supporting:
- As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe by using existing --history-retention-count
- As a Horizon user, I want any transactions related to entities that are recognized as allowed by filter rules to be included in 'backfill' timeframe.

Pre-requisites:

This design is dependent on #4221 and #4222 being implemented first to establish the initial filtration framework within ingestion.
This feature leverages the existing --history-retention-count for backfill range.

2opremio commented 2 years ago

As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe.

Is this really necessary?

Can't we piggyback on the retention count cmdline flag?

2opremio commented 2 years ago

Here's an idea of how we can implement backfilling:

We keep create indices for the filtered out data to backfill:

history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

Hopefully these tables won't occupy a lot (less than what history_operation_participants) since they are done at the ledger level. But we should check experimentally. If they do, we could refine this further to only include information about the entities (accounts/assets) that we filtered out.
We create a table to keep backfilling state (say backfill_queue) to indicate what accounts/ledgers need to be backfilled based on changes on the filter configurations. We need this to make sure backfilling works across restarts.
We modify the ingestion system to walk over the backfill_queue and the history_ledger_* to perform the backfilling.

(1) could be replaced by a txmeta archive + index once we have it.

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

2opremio commented 2 years ago

@bartekn thoughts?

sreuland commented 2 years ago

Can't we piggyback on the retention count cmdline flag?

yes, definitely, updated description to be clear about that usage.

sreuland commented 2 years ago

@2opremio , @bartekn , wondering if could remove the backfill_queue aspect just to explore simplicity, the new backfill process invokes the ingestion engine regularly on scheduled intervals, triggering the latest filter rules to be executed by processors with range based on --history-retention-count. The end result would be idempotent, as ingestion processors just upsert to history, correct?

the backfill_queue increases in complexity with the upcoming transitive rules, like 'all trustlines from account x' or 'sponsored accounts of account x', where the filters would need to compute effective list dynamically and then convey that into the queue, whereas if we just let the ingest processors run the filters, then by default effective scope of the filters is exercised by default through the existing ingest->processor->filter.FilterTransaction() path.

2opremio commented 2 years ago

The end result would be idempotent, as ingestion processors just upsert to history, correct?

Yes it would but I am not sure we can afford that (performance-wise). I also think that a stateless design is cleaner.

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

In this case you I think do need state, otherwise you won't know what to gargage-collect if Horizon is restarted.

2opremio commented 2 years ago

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

Actually, it doesn't need to be enabled when filtering is disabled, because then everything is reingested and thus, there is nothing to backfill there.

sreuland commented 2 years ago

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

@jcx120 , I think we discussed this 'stale' or 'garbage' state as part of filter lifecycle earlier, but can find reference, can you confirm product expectation for when a filter rule scope is changed which reduces it's scope and therefore any already ingested history data that no longer fits within the filter rule scope becomes 'stale' but still left on history db, is that acceptable or does the 'stale' data need to be purged?

jcx120 commented 2 years ago

@sreuland Purging of 'garbage' data (that are left behind due to a change in filter rules) can be, for this current phase, treated as a low priority / nice to have (and something we can come back to after all top priority features are implemented). The assumption is that for majority of users who apply filters, the dataset will be sufficiently small such that doing a clean re-ingest from scratch would be fast and cheap.

bartekn commented 2 years ago

history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

This will be a very similar structure to indexes we've been working with @paulbellamy: they determine in which checkpoints a given account was active (also wrt payment/all operations and sucessfull/all txs). So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

So when (or if ever) these indexes will be ready in production (they are ready but not benchmarked/tested/checked for correctness) you won't need extra tables and the process will be simpler.

2opremio commented 2 years ago

So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

This is good, in the sense that we can share the solution.

However, I also think this may be a showstopper for the approach I proposed, since it may make backfilling too slow (due to the slowness of Captive core to jump around)

So, we may need to wait for the implementation of a txmeta archive before backfilling can be sufficiently performant.

stellar / go