Historical Block Scanning For Dynamic Data Sources

dOrgJelli commented 5 years ago

Do you want to request a feature or report a bug? Feature.

What is the current behavior? From https://github.com/graphprotocol/graph-node/pull/832 : "Whenever we we process a block and mappings request new data sources, these data sources are collected and, after having processed the block, are instantiated from templates. We then process the current block again but only with those new data sources. The entity operations from this are merged into the ones we already have for the block. After that, the dynamic data sources are persisted by adding the data sources to the subgraph instance and by adding entity operations to store them in the db (for 1.)."

What is the desired behavior? ".... We then process all blocks in the range of [contract creation -> current block] for each newly created data source."

More explanation. In the DAOstack protocol, new DAOs can be created in a number of different ways. In order to support this (and minimize spam), the desired behavior for the subgraph is to only index DAOs that're added to a universal registry of "accepted" DAOs. Given the graph-node's current functionality, a lot of information about these newly registered DAOs would never be read/processed/stored because it has taken place prior to the current block. In order to support this scenario (which IMO is bound to be commonplace for other projects), being able to process the entire block history for a given contract upon instantiating it as a data source would be ideal.

Thread here: https://github.com/daostack/subgraph/issues/197

leoyvens commented 5 years ago

This is a use case that's likely to come up again and we'll want to support it in some way, however the design and implementation is not simple. It sounds like you can solve this in a different, less ideal way, so I recommend you take that route while we figure this one out.

Jannis commented 5 years ago

The three main parts that are difficult about this:

There's currently one block stream per subgraph. It only moves forward. When a data source template is instantiated, the block stream is restarted with the additional filters from the new data sources. Making this stream go back and process all past events of a new data source will be complicated. It may be better to spawn separate block streams for each data source but have them synced somehow.
Processing past events would require accessing the store at different points in the past. This is not currently possible. Even if it was, it could introduce a time travel problem, where changes generated by new data source B on an older block would have changed the behavior of an already existing data source A.
Chain reorgs. A block in which a dynamic data source is created may be removed from the chain again later. When that happens, the dynamic data source needs to be removed again and all its data from that block needs to be removed as well. Now, if we include all past events of such a new data source, we'd have to also remove all entity changes from those past events.

dOrgJelli commented 5 years ago

Thank you for this breakdown @Jannis, it was super insightful. We have a workaround we'll use for now @leodasvacas. Looking forward to seeing this spec progress. Godspeed!

leoyvens commented 5 years ago

Rationale / Use Cases

Data source templates are commonly used in subgraphs that track contracts which server as registries for other contracts. In many cases the contract was already in use before being added to the registry and the subgraph would like to process that past data, but currently that's not possible. See the original comment for a description of a use case.

A relatively simple way to solve this is to extend our logic that reprocesses a single block for the new data source to process any range of past blocks.

Requirements

A data source template should be able to process historical triggers starting from a given past block, that's anywhere between genesis and the block at which the data source creation block.
The entity changes should all be written as if they happened on data source creation block, to avoid re-writing history.
What data should be visible to store.get is an open question. I'd suggest that all of this event processing should be done on top of the current state of the store. A consequence is that triggers on the new data source can see entity changes done by blocks "from the future" in the parent data source. This may or may not be desirable depending on the use case, but I think it's a good default because to do store.get against the historical store state we'd need always store the complete history of entity changes for all subgraphs, increasing storage costs across the board.

Proposed User Experience

A new method is added to the generate data source class, createWithStartBlock(address: Address, startBlock: BigInt). The current create method will behave as if startBlock is the current block. When a data source is created, all of its triggers in the [startBlock, currentBlock] ranged are processed before indexing continues, generalizing the current behaviour of re-processing the currentBlock.

A caveat is that this will halt indexing of other data sources and the progress is not stored across restarts of the graph node, so this should only be used if the triggers can be processed in a few minutes.

Proposed Implementation

Code generation in graph-cli will need to include createWithStartBlock, which will now pass an additional string parameter to dataSource.create which is the start block. In graph-node, the dataSource.create host export will need to handle that second parameter, defaulting to the current block is not present.

The implementation will generalize the logic we have for re-processing the current block in the instance manager, but we will to better organize our trigger filtering logic so that it can be used in the in the block stream for the initial filtering and also in the instance manager for processing dynamic data sources. The goal is to refactor the current code so we have end up with a re-usable function that looks like:

fn triggers(from: BlockPtr, to: BlockPtr, current_block: BlockPtr,
          log_filter: LogFilter, call_filter: CallFilter,
      block_filter: BlockFilter) -> Vec<EthereumTrigger>

This should internally check the reorg threshold to decide how to safely retrieve the triggers.

Using this might simplify our current block stream, since what we do right now is basically:

Get potentially relevant logs and calls.
Get the block hashes in which they occur.
Load those blocks.
Look for relevant logs and calls in those blocks.
Pass those as triggers to the data sources that match on them.

I don't know what our rationale was to do things in this way, but steps 2-4 seem unecessary to me, I think we could just get the logs and calls and pass them to the data sources, which could also be a performance win.

Proposed Documentation Updates

The secton 'Data Source Templates -> Instantiating a Data Source Template' should be updated to document createWithStartBlock, the target use cases and its caveats.

Proposed Tests / Acceptance Criteria

Codegen tests in graph-cli.
Manual testing of the graph-node changes and the new feature, on realistic subgraphs.

Tasks

[ ] (3d) graph-node: Refactor how triggers are filtered for better code re-use. This might end up as a separate PR.
[ ] (1d) graph-cli: Codegen for createWithStartBlock and tests.
[ ] (1d) graph-node: Accept two params in dataSource.create, plumb that to the instance manager.
[ ] (1d) graph-node: Use new trigger filtering code to generalize the block re-processing logic to take a range.
[ ] (2d) graph-node: Manual testing.
[ ] (1d) docs: Write the docs.

davekaj commented 5 years ago

Would be good to simplify compounds subgraph and allow us to not upgrade it each time a new asset is added

andsilver commented 2 years ago

Is this supported now?

leoyvens commented 2 years ago

@itopmoon This is still on the radar but it is a difficult one.

andsilver commented 2 years ago

Thanks for your update. Just wanted to know if it's available. It seems a good feature to have I think.

graphprotocol / graph-node