Jannis commented 5 years ago

Summary

So far we only support Ethereum contracts as data sources and event handlers as mappings. As a consequence, the code that wires up subgraphs for indexing is kind of special-cased (e.g. everything is based on the block stream). We aim to extend subgraphs to support other data sources like IPFS files and make it easier to add more data source types over time.

This issue describes the requirements for supporting multiple data source types and the changes proposed to implement them.

Requirements

Need to support multiple data source types (e.g. ethereum/contract, ipfs/file etc.).
Supported data source types have some fields in common (like kind, name and mapping); other fields are unique to the data source type.
Data sources of all types are started and stopped with the subgraph deployment.
Data sources require their own state.
The overall progress of a subgraph needs to be calculated from all of its data sources.

Proposed Changes

I propose that the implementation takes place in two phases. First, implement block streaming, state management and progress information per data source rather than per subgraph. Then, extend the system to supporting generic (potentially different) data sources.

Phase 1

Make SubgraphInstanceManager and SubgraphInstance instantiate a DataSourceInstance for every data source.
Add a DataSourceIndexer trait that is responsible for indexing a particular type of data source. Add an EthereumContractInstance version that operates based on a block stream.
Add Store helper to read the state of a data source of a subgraph as a From<serde_json::Value> or similar.
Change block pointers to be for a data source instead a subgraph.

Phase 2

Extend the GraphQL schema for the subgraph of subgraphs to support data sources of different types, either through interfaces or unions.
Turn data sources and mappings in SubgraphManifest into either an enum or come up with builder-style pattern for creating DataSourceInstances from data sources.
Make SubgraphRegistrar write different types of data sources to the store.

leoyvens commented 5 years ago

Does each EthereumContractInstance have it's own block stream? I see two options, given the requirement that events are processed block by block, and within each block in a specific data source order.

What we do right now, which can be described as:

for each block in blockstream:
 for each datasource in blockstream.datasources:
       process(block, datasource)

If each datasource holds a block stream, we might have to flip to:

for each datasource in datasources:
 let block = datasource.blockstream.next();
 process(block, datasource)

What we have right now with the single block stream seems better.

Jannis commented 5 years ago

@Zerim What's your latest thinking about dependencies between data sources, like order and overlap in entities etc.?

I think even if entities are disjoint between data sources, it would still be very weird if data source A was at block 5,000,000 and data source B was at block 3,000,000. So you're right @leodasvacas, they can't be independent (like I was thinking). We can think of the block stream as a way of synchronizing data sources that originate from the same blockchain (kinda like a master clock).

The only tricky part here is that we need to make sure to re-scan the current block and handle new evens when a dynamic data source is added (see #719). That's a problem we can solve though.

Jannis commented 5 years ago

I still propose DataSourceInstance to encapsulate the processing logic for each data source type in a dedicated type. I'd go one step further even and have the default DataSourceInstance trait be stateless and extend it with a StatefulDataSourceInstance. That way some data sources can have their own state (like IPFS file streams) and others will purely rely on their blockchain adapter's stream of information.

Jannis commented 5 years ago

Let me back out of the StatefulDataSourceInstance suggestion. I think it's abstracting too early.

Here's my current thinking of how we can rework the current codebase:

Add the following types:

struct BlockStreamBuilders {
ethereum: Arc<BlockStreamBuilder<EthereumBlockStream>>,
...
}

struct DataSourceAdapters {
ethereum: Arc<EthereumAdapter>,
ipfs: Arc<IpfsAdapter>, // once we add IPFS bulk import
}

enum DataSourceInstance {
EthereumContract(EthereumContractDataSourceInstance),
IpfsFile(IpfsFileDataSourceInstance), // once we add IPFS bulk import
}

The data source under SubgraphManifest also becomes an enum with the same kind of variants.

Make SubgraphInstanceManager take a BlockStreamBuilders struct and a DataSourceAdapters struct (in addition to the Store etc.).
When a subgraph is created:
1. Create a new SubgraphInstance with a SubgraphInstanceBuilder. Pass the builders and adapters to it, as well as a Cancelable for when the subgraph deployment is removed/unassigned.
2. Instantiate all its data sources.
3. Spawn all IPFS file data sources as standalone, cancelable Tokio tasks.
4. Collect Ethereum block stream information from data sources and, if not empty, build and spawn an Ethereum block stream, passing all events on to the SubgraphInstance.

Subgraph of subgraphs

In the subgraph of subgraphs, I propose the following change: Instead of

subgraphDeployment(id: ...) {
  latestEthereumBlockNumber
  latestEthereumBlockHash
  totalEthereumBlockCount
}

organize the progress information per stream and add a dataSources field for progress information of data sources that have their own state:

subgraphDeployment(id: ...) {
  streams {
    ethereum {
      latestBlockNumber
      latestBlockHash
      totalBlockNumber
    }
  }
  dataSources { // a [DataSourceState!]! value
    id
    ... on IPFSFileDataSource {
      bytesRead
      totalBytes
    }
  }
}

How to add new data source types

Add a new variant to the data source enum under SubgraphManifest; implement writing this variant to the database as an entity.
Implement a new DataSourceInstance enum variant for instances of the new data source type.
If necessary, add a new BlockStreamBuilder for the new blockchain / data source type.
In SubgraphInstanceManager, wire up data sources of the new type

leoyvens commented 5 years ago

As we were discussing on Discord, there must be a sequence of events across data sources so that a subgraph can behave deterministically. Right now this is implicitly Ethereum block numbers, but we'll need to abstract that. Before figuring out the code, it's worth abstracting the mental model.

What about this for basic principles:

Processing within a subgraph is based on trigger events.
All events must map to a sequence number (e.g. a timestamp, an Ethereum block number).
A data source exposes a stream of trigger events, emitted in monotonically non-decreasing order of sequence numbers.
Events from all data sources are processed in order of:
- Sequence number,
- If the sequence number is equal, events are processed by order of data source, since data sources have a strict order implied in the manifest.
- If the sequence number and data source are equal, the event is processed in order of emission.
Processing of events for a sequence number is atomic. We only commit to the store when all processing for a sequence number is done. A data source can trigger a revert to a previous sequence number, for example due to a proof-of-work reorg.

Zerim commented 5 years ago

My preference would be to make data sources responsible for disjoint entity types, when it comes to reading from/ writing to the store inside of mappings.

I would not not attempt to synchronize across data sources, nor would you need to if you restrict the mappings in the way I describe above. If the data sources represent two different blockchains, then the user should specify at query time, as of which block they wish to query for each data source.

One question is if we want to support multiple transaction trigger types for the same blockchain (i.e. Solidity events, external transaction triggers, block triggers, internal transaction triggers in Etheruem) within a single datasource. Synchronization is not the challenge here, since they all happen within a single block, rather it's that the ordering of these triggers relative to one another would need to be specified.

Zerim commented 5 years ago

Update based on our conversation in Discord:

In the future, we could introduce the notion of compound datasources which require that an Indexing Node interact w/ multiple blockchains or storage networks, and specifies an interleaving strategy.

However, in the default case, an Indexing Node should not need to index every data source in a subgraph in order to run the mappings for a single data source.

graphprotocol / graph-node

Multiple data sources #762

Summary

Requirements

Proposed Changes

Phase 1

Phase 2

Subgraph of subgraphs

How to add new data source types