Thus far we've defined data availability as a secondary concern for the protocol, because The Graph does not claim to be a source of truth for any data, only an indexed view on data which exists elsewhere - on Ethereum and IPFS.
However, even if we keep our goals modest, failing gracefully in the event of data being unavailable, rather than guaranteeing that data is available, data availability presents some challenges and attack vectors to our protocol design which must be accounted for.
Scenarios
There are several possible scenarios for why an IPFS object might be unavailable:
The data was available once, but no one pinned the data to keep it available.
The data was available once, and only on Indexing Node pinned it to keep it available
The data was never available (i.e. the IPFS hash was generated randomly, not by hashing actual content)
The data was never widely available. (i.e. An Indexing Node or user posted a valid content hash, but never made it widely available, to the network).
Additionally there are several possible responses an Indexing Node might take in these scenarios
Give up on indexing the subgraph.
Engage in anticompetitive behavior. Monopolistically collect fees in the data retrieval market, knowing that no one else can index the subgraph.
Engage in malicious behavior. Serve incorrect data to end users knowing that no one can index the subgraph to check their work.
Possible Solutions
If a centralized Indexing Node (run by The Graph team) or a set of Oracles cannot index a subgraph, then the data retrieval marketplace for that subgraph is disabled.
Make Indexing Nodes responsible for pinning data. Any Indexing Node that processes a query as of a certain block, is also attesting that they will keep the underlying data available to the network for X amount of time. Then we could apply some sort of Swarm or Filecoin-style "Proof of Retrievability" game to enforce this.
Require economic guarantees from an external protocol such as SWARM or Filecoin to guarantee that data corresponding to a content hash is available before it is considered valid for indexing.
Skip indexing events which reference data which is unavailable.
Analysis
Solution 1 seems to provide an adequate incentive, perhaps, for Curators and Indexing Nodes to make sure data stays available, but it doesn't address the scenario where someone maliciously uploads an invalid content hash, or unavailable content hash, in order to deny service to a specific subgraph.
Solution 2 seems reasonable for mitigating anticompetitive behavior, however, it also doesn't mitigate the denial of service attack. Additionally, if the size of objects on IPFS isn't bounded, then it presents another attack vector, making it possible to force Indexing Nodes to store an untennable amount of data or else be slashed.
Solution 3 presents difficulties with subjectivity. For example what if SWARM guarantees the availability of data when one Indexing Node indexes a subgraph, but some time later, when another Indexing Node is indexing the data, SWARM no longer provides this guarantee. Should the first Indexing Node always have to monitor the state of all IPFS objects it has seen and re-index if one of them is no longer seen as available? Also SWARM and Filecoin only provide economic guarantees, so even with a reasonable amount of assurances from the external protocol, there is still the possibility that data is unavailable.
Solution 4, also presents the subjectivity problem. How does one Indexing Node know when it's okay to skip an event, or when they simply need to get data from another Indexing Node, or perhaps another Indexing Node is withholding the data required to index the event. Could perhaps be used in conjunction with one of the other solutions.
Have I missed anything? Note, I haven't touched "Index Chains" in this write-up, since I'm thinking of the v1 hybrid solution right now, but I think much of the same analysis holds true.
Thus far we've defined data availability as a secondary concern for the protocol, because The Graph does not claim to be a source of truth for any data, only an indexed view on data which exists elsewhere - on Ethereum and IPFS.
However, even if we keep our goals modest, failing gracefully in the event of data being unavailable, rather than guaranteeing that data is available, data availability presents some challenges and attack vectors to our protocol design which must be accounted for.
Scenarios
There are several possible scenarios for why an IPFS object might be unavailable:
Additionally there are several possible responses an Indexing Node might take in these scenarios
Possible Solutions
Analysis
Solution 1 seems to provide an adequate incentive, perhaps, for Curators and Indexing Nodes to make sure data stays available, but it doesn't address the scenario where someone maliciously uploads an invalid content hash, or unavailable content hash, in order to deny service to a specific subgraph.
Solution 2 seems reasonable for mitigating anticompetitive behavior, however, it also doesn't mitigate the denial of service attack. Additionally, if the size of objects on IPFS isn't bounded, then it presents another attack vector, making it possible to force Indexing Nodes to store an untennable amount of data or else be slashed.
Solution 3 presents difficulties with subjectivity. For example what if SWARM guarantees the availability of data when one Indexing Node indexes a subgraph, but some time later, when another Indexing Node is indexing the data, SWARM no longer provides this guarantee. Should the first Indexing Node always have to monitor the state of all IPFS objects it has seen and re-index if one of them is no longer seen as available? Also SWARM and Filecoin only provide economic guarantees, so even with a reasonable amount of assurances from the external protocol, there is still the possibility that data is unavailable.
Solution 4, also presents the subjectivity problem. How does one Indexing Node know when it's okay to skip an event, or when they simply need to get data from another Indexing Node, or perhaps another Indexing Node is withholding the data required to index the event. Could perhaps be used in conjunction with one of the other solutions.
Have I missed anything? Note, I haven't touched "Index Chains" in this write-up, since I'm thinking of the v1 hybrid solution right now, but I think much of the same analysis holds true.