graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.89k stars 962 forks source link

Arbitrary HTTP File Data Sources #4847

Open azf20 opened 1 year ago

azf20 commented 1 year ago

Extend File Data Sources to support fetching arbitrary off-chain files, based on an HTTP URL.

kind: file/http

The fetching process should be aware of HTTP return codes which provide useful information on the "liveness" of a given endpoint. This might require more robust retry & back-off rules (for example if an endpoint is no longer active).

Unlike IPFS and Arweave File Data Sources, Files from HTTP endpoints may change over time, which implies a need to refetch and reprocess. In that case, the new data would need to over-write previous entities. This refetching could be triggered by a new on-chain entity creating a file data source with the same URL, or by a manual update on the indexing status API:

refetchFileDataSource(deployment: "Qm...", file: "http://myfile.com/1.json")

Note: there is the case of NFTs, where the actual tokenURI changes over time. This might require a further pattern where the tokenURI is refetched as of the latest block, but this requires further definition.

azf20 commented 1 year ago

An additional use case which the above design doesn't allow for is where there is a change in on-chain state which needs to be reflected in a subgraph, where there isn't an emitted event.

A specific example is where the tokenURI(tokenId) is updated on an ERC721 contract, perhaps to point to a new piece of token metadata. In this case, the subgraph needs to:

  1. make an eth_call to fetch the latest value for the tokenURI
  2. create a new file data source with that value
  3. update subgraph state accordingly (either updating the "chain-based" state for the new tokenURI, which would make the on-chain data non-deterministic, or updating the "file-based" state based on the new data, which would break the current causality region constraints)

This process would be triggered by an off-chain event (the equivalent to "refresh metadata", on sites such as Opensea)

Note that you could argue that with substreams you could track changes in storage slots, but these don't necessarily adhere to a standard across contracts, if your use case is the above NFT example

mangas commented 1 year ago

From a security perspective, I think there are a couple of concerns that come to mind, this could be "easily" leveraged with malicious intent, for instance, an unprotected node endpoint could be used to make a lot of calls to a single endpoint or access data that was accidentally made available over http. We can mitigate but setting an allow list, but then we go back to it being less than useful since the indexers would need to know ahead of time which URLs can be accessed. (We could consider adding base_urls to the manifest which would be enforced, that does not prevent the above.)

From a determinism point of view this is really hard to manage as there are too many variables, from unavailability to data changes over time. I think the implementation effort, considering the security measures we need to consider, determinism, usability, etc is quite high for something that could easily be solved by the end user, who could either publish relevant data to ipfs/arweave or embed this fetching of data from the app's side, specifically the SG can still get the URL saved and when consuming the information, the subgraph consumer would fetch this data in parallel.

There is an operational concern of data sizes, this is obviously not new to arweave and ipfs but those storage have an incentive (cost) to keep files smaller, other endpoints like S3 are very cost efficient for quite larges files, this may also be an overhead in terms of database efficient when storing large blobs, the graph-node can limit this size, of course, which once again may impact some of the usability. (This may be something already discussed and settled since it's similar to IPFS and arweave for extreme cases)

In terms of providing an alternative, I think an interesting discussion could be the introduction of some query time WASM, this could provide some of the flexibility by supporting queries to existing subgraphs (potentially more than 1) and the ability to fetch some files from decentralised but maybe even arbitrary http data sources as well. The difference here is it could be metered, maybe stored temporarily(or permanently as well) and overall more flexible since it builds on the existing solid foundation of subgraphs but would also allow very flexible behaviour.

schmidsi commented 11 months ago

I really think that this is one of the features that would unlock a complete new class of use-cases. That being said, I also understand that this is tricky to get right. So maybe it helps if we specify a real-world use-case and then design the system backwards from there?

Use Case: POAP

I suggest we take the POAP subgraph as an example.

The tracked POAP contract is basically an ERC721 contract with tokenURIs hosted on their own servers, which then references a image_url also hosted on their own servers (examples linked).

Requirements

In my opinion, a consumer of such a subgraph might have the following requirements:

github-actions[bot] commented 5 months ago

Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.

alex-pakalniskis commented 1 month ago

Ford, Simon, and I recently met with a Andreas, a dev from the LUKSO project. Our conversation was mainly so Andreas could share feedback on LUKSO's use of The Graph, but we briefly discussed how LUKSO has tackled arbitrary HTTP file data sources.

Andreas mentioned that LUKSO stores Keccak-256 hashes of HTTP file data source metadata on chain. This enables them to retrieve via Arweave or IPFS and still validate the HTTP file data source.

I think their erc725.js tool, the concept of VerifiableURI, and the encodeDataSourceWithHash method might be relevant here.