Open dOrgJelli opened 5 years ago
This is a use case that's likely to come up again and we'll want to support it in some way, however the design and implementation is not simple. It sounds like you can solve this in a different, less ideal way, so I recommend you take that route while we figure this one out.
The three main parts that are difficult about this:
There's currently one block stream per subgraph. It only moves forward. When a data source template is instantiated, the block stream is restarted with the additional filters from the new data sources. Making this stream go back and process all past events of a new data source will be complicated. It may be better to spawn separate block streams for each data source but have them synced somehow.
Processing past events would require accessing the store at different points in the past. This is not currently possible. Even if it was, it could introduce a time travel problem, where changes generated by new data source B on an older block would have changed the behavior of an already existing data source A.
Chain reorgs. A block in which a dynamic data source is created may be removed from the chain again later. When that happens, the dynamic data source needs to be removed again and all its data from that block needs to be removed as well. Now, if we include all past events of such a new data source, we'd have to also remove all entity changes from those past events.
Thank you for this breakdown @Jannis, it was super insightful. We have a workaround we'll use for now @leodasvacas. Looking forward to seeing this spec progress. Godspeed!
Data source templates are commonly used in subgraphs that track contracts which server as registries for other contracts. In many cases the contract was already in use before being added to the registry and the subgraph would like to process that past data, but currently that's not possible. See the original comment for a description of a use case.
A relatively simple way to solve this is to extend our logic that reprocesses a single block for the new data source to process any range of past blocks.
store.get
is an open question. I'd suggest that all of this event processing should be done on top of the current state of the store. A consequence is that triggers on the new data source can see entity changes done by blocks "from the future" in the parent data source. This may or may not be desirable depending on the use case, but I think it's a good default because to do store.get
against the historical store state we'd need always store the complete history of entity changes for all subgraphs, increasing storage costs across the board.A new method is added to the generate data source class, createWithStartBlock(address: Address, startBlock: BigInt)
. The current create
method will behave as if startBlock
is the current block. When a data source is created, all of its triggers in the [startBlock, currentBlock]
ranged are processed before indexing continues, generalizing the current behaviour of re-processing the currentBlock
.
A caveat is that this will halt indexing of other data sources and the progress is not stored across restarts of the graph node, so this should only be used if the triggers can be processed in a few minutes.
Code generation in graph-cli will need to include createWithStartBlock
, which will now pass an additional string parameter to dataSource.create
which is the start block. In graph-node, the dataSource.create
host export will need to handle that second parameter, defaulting to the current block is not present.
The implementation will generalize the logic we have for re-processing the current block in the instance manager, but we will to better organize our trigger filtering logic so that it can be used in the in the block stream for the initial filtering and also in the instance manager for processing dynamic data sources. The goal is to refactor the current code so we have end up with a re-usable function that looks like:
fn triggers(from: BlockPtr, to: BlockPtr, current_block: BlockPtr,
log_filter: LogFilter, call_filter: CallFilter,
block_filter: BlockFilter) -> Vec<EthereumTrigger>
This should internally check the reorg threshold to decide how to safely retrieve the triggers.
Using this might simplify our current block stream, since what we do right now is basically:
I don't know what our rationale was to do things in this way, but steps 2-4 seem unecessary to me, I think we could just get the logs and calls and pass them to the data sources, which could also be a performance win.
The secton 'Data Source Templates -> Instantiating a Data Source Template' should be updated to document createWithStartBlock
, the target use cases and its caveats.
Codegen tests in graph-cli.
Manual testing of the graph-node changes and the new feature, on realistic subgraphs.
[ ] (3d) graph-node: Refactor how triggers are filtered for better code re-use. This might end up as a separate PR.
[ ] (1d) graph-cli: Codegen for createWithStartBlock
and tests.
[ ] (1d) graph-node: Accept two params in dataSource.create
, plumb that to the instance manager.
[ ] (1d) graph-node: Use new trigger filtering code to generalize the block re-processing logic to take a range.
[ ] (2d) graph-node: Manual testing.
[ ] (1d) docs: Write the docs.
Would be good to simplify compounds subgraph and allow us to not upgrade it each time a new asset is added
Is this supported now?
@itopmoon This is still on the radar but it is a difficult one.
Thanks for your update. Just wanted to know if it's available. It seems a good feature to have I think.
Do you want to request a feature or report a bug? Feature.
What is the current behavior? From https://github.com/graphprotocol/graph-node/pull/832 : "Whenever we we process a block and mappings request new data sources, these data sources are collected and, after having processed the block, are instantiated from templates. We then process the current block again but only with those new data sources. The entity operations from this are merged into the ones we already have for the block. After that, the dynamic data sources are persisted by adding the data sources to the subgraph instance and by adding entity operations to store them in the db (for 1.)."
What is the desired behavior? ".... We then process all blocks in the range of [contract creation -> current block] for each newly created data source."
More explanation. In the DAOstack protocol, new DAOs can be created in a number of different ways. In order to support this (and minimize spam), the desired behavior for the subgraph is to only index DAOs that're added to a universal registry of "accepted" DAOs. Given the graph-node's current functionality, a lot of information about these newly registered DAOs would never be read/processed/stored because it has taken place prior to the current block. In order to support this scenario (which IMO is bound to be commonplace for other projects), being able to process the entire block history for a given contract upon instantiating it as a data source would be ideal.
Thread here: https://github.com/daostack/subgraph/issues/197