graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.88k stars 954 forks source link

Multiple data sources #762

Open Jannis opened 5 years ago

Jannis commented 5 years ago

Summary

So far we only support Ethereum contracts as data sources and event handlers as mappings. As a consequence, the code that wires up subgraphs for indexing is kind of special-cased (e.g. everything is based on the block stream). We aim to extend subgraphs to support other data sources like IPFS files and make it easier to add more data source types over time.

This issue describes the requirements for supporting multiple data source types and the changes proposed to implement them.

Requirements

Proposed Changes

I propose that the implementation takes place in two phases. First, implement block streaming, state management and progress information per data source rather than per subgraph. Then, extend the system to supporting generic (potentially different) data sources.

Phase 1

  1. Make SubgraphInstanceManager and SubgraphInstance instantiate a DataSourceInstance for every data source.
  2. Add a DataSourceIndexer trait that is responsible for indexing a particular type of data source. Add an EthereumContractInstance version that operates based on a block stream.
  3. Add Store helper to read the state of a data source of a subgraph as a From<serde_json::Value> or similar.
  4. Change block pointers to be for a data source instead a subgraph.

Phase 2

  1. Extend the GraphQL schema for the subgraph of subgraphs to support data sources of different types, either through interfaces or unions.
  2. Turn data sources and mappings in SubgraphManifest into either an enum or come up with builder-style pattern for creating DataSourceInstances from data sources.
  3. Make SubgraphRegistrar write different types of data sources to the store.
leoyvens commented 5 years ago

Does each EthereumContractInstance have it's own block stream? I see two options, given the requirement that events are processed block by block, and within each block in a specific data source order.

  1. What we do right now, which can be described as:

    for each block in blockstream:
     for each datasource in blockstream.datasources:
           process(block, datasource)
  2. If each datasource holds a block stream, we might have to flip to:

    for each datasource in datasources:
     let block = datasource.blockstream.next();
     process(block, datasource)

What we have right now with the single block stream seems better.

Jannis commented 5 years ago

@Zerim What's your latest thinking about dependencies between data sources, like order and overlap in entities etc.?

I think even if entities are disjoint between data sources, it would still be very weird if data source A was at block 5,000,000 and data source B was at block 3,000,000. So you're right @leodasvacas, they can't be independent (like I was thinking). We can think of the block stream as a way of synchronizing data sources that originate from the same blockchain (kinda like a master clock).

The only tricky part here is that we need to make sure to re-scan the current block and handle new evens when a dynamic data source is added (see #719). That's a problem we can solve though.

Jannis commented 5 years ago

I still propose DataSourceInstance to encapsulate the processing logic for each data source type in a dedicated type. I'd go one step further even and have the default DataSourceInstance trait be stateless and extend it with a StatefulDataSourceInstance. That way some data sources can have their own state (like IPFS file streams) and others will purely rely on their blockchain adapter's stream of information.

Jannis commented 5 years ago

Let me back out of the StatefulDataSourceInstance suggestion. I think it's abstracting too early.

Here's my current thinking of how we can rework the current codebase:

Subgraph of subgraphs

In the subgraph of subgraphs, I propose the following change: Instead of

subgraphDeployment(id: ...) {
  latestEthereumBlockNumber
  latestEthereumBlockHash
  totalEthereumBlockCount
}

organize the progress information per stream and add a dataSources field for progress information of data sources that have their own state:

subgraphDeployment(id: ...) {
  streams {
    ethereum {
      latestBlockNumber
      latestBlockHash
      totalBlockNumber
    }
  }
  dataSources { // a [DataSourceState!]! value
    id
    ... on IPFSFileDataSource {
      bytesRead
      totalBytes
    }
  }
}

How to add new data source types

  1. Add a new variant to the data source enum under SubgraphManifest; implement writing this variant to the database as an entity.
  2. Implement a new DataSourceInstance enum variant for instances of the new data source type.
  3. If necessary, add a new BlockStreamBuilder for the new blockchain / data source type.
  4. In SubgraphInstanceManager, wire up data sources of the new type
leoyvens commented 5 years ago

As we were discussing on Discord, there must be a sequence of events across data sources so that a subgraph can behave deterministically. Right now this is implicitly Ethereum block numbers, but we'll need to abstract that. Before figuring out the code, it's worth abstracting the mental model.

What about this for basic principles:

Zerim commented 5 years ago

My preference would be to make data sources responsible for disjoint entity types, when it comes to reading from/ writing to the store inside of mappings.

I would not not attempt to synchronize across data sources, nor would you need to if you restrict the mappings in the way I describe above. If the data sources represent two different blockchains, then the user should specify at query time, as of which block they wish to query for each data source.

One question is if we want to support multiple transaction trigger types for the same blockchain (i.e. Solidity events, external transaction triggers, block triggers, internal transaction triggers in Etheruem) within a single datasource. Synchronization is not the challenge here, since they all happen within a single block, rather it's that the ordering of these triggers relative to one another would need to be specified.

Zerim commented 5 years ago

Update based on our conversation in Discord:

In the future, we could introduce the notion of compound datasources which require that an Indexing Node interact w/ multiple blockchains or storage networks, and specifies an interleaving strategy.

However, in the default case, an Indexing Node should not need to index every data source in a subgraph in order to run the mappings for a single data source.