graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.89k stars 962 forks source link

Query time computation (custom GraphQL resolvers) #880

Open davekaj opened 5 years ago

davekaj commented 5 years ago

Description

Solidity contracts can have view/constant functions that do calculations for free. A some-what common pattern is for solidity developers to do this, and then hit an ethereum node for free from the front end to show to most up to date data.

Example of a protocol doing this

Specifically, Compound does this, and for their case it makes a lot of sense. Each block their users accumulate more interest based on the block number increasing. The storage of the smart contract doesn't update, but if you were to cash out your earned interest it would accumulate the amount right before cashing out.

So we have this idea of freshEarnedInterest and staleEarnedInterest where the fresh one is calculated to the most recent block, and the stale one with the most recent block the user interacted with the protocol, and is the real stored value on the blockchain.

Proposed Ideas on handling this

Next Steps

So I suggest query time computation. I imagine this is an epic, and might take a while, so looking for input on the following:

davekaj commented 5 years ago

Ran into this with Uniswap as well (first time I ran into it was Compound)

For reference, here is a practical explanation of what is needed for Uniswap:

We have all the values to calculate their return:

originalInvestmentInEth = (1.5ETH + 250DAI) = 3 ETH
returnInEth = (userLiquidityTokenBalance/ totalLiquiditytokens)( ethBalanceExchange + tokenBalanceExchange/tokenprice ) - originalInvestment
returnInEth = (2/100)(77 + 13000/168.83) - 3
returnInEth = 3.08 - 3 = 0.08

The pattern to pay attention to is that the exchange is always up to date with live numbers, but the user info is out of date, but can still be used to calculate their return based on the live exchange info.

And there is currently no good way to show this in subgraphs that are only triggered on users interacting with the protocol. Calculating this every block isn't a good idea because you need to do it for 1000's of users, all the time. A query time calculation works really well. The pattern is very similar in compound.

leoyvens commented 5 years ago

Rationale / Use Cases

See comments above for use cases.

Requirements

User must be able to define fields whose value is determined at runtime, by executing user-defined logic that has access to APIs such as store.get and contract calls. This logic will be defined in a WASM function, so that it synergizes with the existing support for WASM mappings. The input to this function will be the entity to which the field being resolved belongs, with all of the stored fields set. In the MVP, only scalar fields can be custom resolved, not entity fields.

Proposed User Experience

A resolverModule field is added as a subfield of schema in the manifest, it specifies a WASM file just as we currently do for mappings. In the GraphQL schema, an entity with a custom-resolved field looks like:

type Entity {
    attr1: OtherEntity,
    attr2: BigInt!,
    field: Int! @resolvedBy(func: "resolverName", args: ["attr1", "attr2"]) 
}

where resolverName is a function exported by resolverModule and the list provided in args are the fields of Entity which will be provided as input to resolverName, that function would have the signature:

export function resolverName(entity: Entity): Value

it takes the generated entity type as an argument, the same one we generate for the schema (though not all fields will be present, only those specified in args) and returns a store value. That function has access to the same imports from the host as mappings, except those that mutate the store. At least for the MVP, ipfs.cat is also forbidden, none of the current use cases require it and the semantics of an IPFS timeout during a query are not obvious.

Limitations:

Proposed Implementation

graph-node: Needs to accept the new manifest field and the schema directive. The manifest field is added to the subgraph of subgraphs. The important changes are in the graphql resolver logic. When an object is being resolved, we make sure to fetch the args fields in addition to the fields requested in the query, then custom-resolved fields are executed with all necessary data available.

The resolver checks for the directive, and if present it will instantiate a WasmiModule and execute the custom resolver by passing in an entity with the args fields set. The returned store value is then converted to a query value and coerced. Parallel resolving is something the design allows, but will not be done for the MVP.

To instantiate the WASM module the first time, the resolver must query the IPFS hash of the resolver module from the subgraph of subgraphs, fetch it from IPFS and create a ValidModule. The resolver will keep an LRU cache of the last 100 ValidModules to speed up instantiation.

graph-cli: The new manifest field will be accepted and the WASM file uploaded to IPFS as we do for mappings. No special validation will be done for the MVP.

graph-ts: No changes required.

Proposed Documentation Updates

The manifest and schema changes will be documented in the appropriate places. A new section is added to "Define a subgraph" called "Custom resolvers" which documents example usage of the feature.

Proposed Tests / Acceptance Criteria

It should be possible to write a comprehensive test for this feature as a graph-node unit test. We should also prototype one of the real use cases before merging.

Tasks

davekaj commented 5 years ago

looks good on my end for what is required/desirable for the functionality !

lutter commented 5 years ago

@leodasvacas looks good to me. Custom resolvers require that we do a post-processing step after we query the database, i.e., we transform the user's query into one that only mentions data we actually store, get that data from the database, and then execute a resolver over that data that runs user-supplied functions.

One other thought: this functionality doesn't really change what a subgraph is, but with the current design, changing a resolver function would result in an entirely new subgraph. What about defining these functions in the query itself? What I am not clear on for that is how the user would get WASM code into their queries.

lutter commented 5 years ago

One other thing: we could also go with something like this:

type Entity {
    attr1, .., attrN
    field: Int! @resolvedBy(func: "resolverName") 
}

and then in queries have users write

entities {
  attr1
  field { attr2 attr3 }
}

which would cause the function for field to be called with a JSON value representing { attr2 attr3 }.

The query we run to fetch data from the store would simply be

entitites {
  attr1 attr2 attr3
}

The thing that's a little mindbending is that the parent object for resolving { attr2 attr3 } would be the entity, i.e. descending into field does not change the parent object as it does for other relations. It would make it possible to pass the result of more complex subqueries into the resolver function though.

In general, it would be good to disallow calling store.get in the resolver function since that can easily lead to N+1 query problems; I think allowing subqueries as arguments to user-supplied functions would greatly reduce the need for store.get.

leoyvens commented 5 years ago

@lutter Thanks for the review!

Custom resolvers require that we do a post-processing step after we query the database

Yes, just like graphql coercions, how big of a change this will be depends on whether we'll have coercions on the first implementation of SQL resolvers or not. I left a comment on the SQL resolvers issue about this.

Also I realized a flaw in my implementation plan (I'll update it), we can't just re-use the result map because the inputs to the resolver may very well not be present in the query, so we'll need to modify the query to include those inputs. This complication makes me wonder if we should block this on SQL query combination.

User-supplied functions can be slow

Good point, we should differentiate the time taken by custom resolvers when logging the query time.

It would be better to not make the arguments to user-supplied functions the entire entity.

I totally agree, I'll update the design to include the args parameter. Do you think it can be optional, so by default we return the entire entity? In that case it could be added after the initial implementation. This does relate to the sub-entity question, see the end of this comment.

This functionality doesn't really change what a subgraph is

I agree that adding a custom resolved field is simply a convenience and is 100% backwards compatible for a subgraph. However this does change the capabilities of subgraph and should change the ID. It's a waste to re-index the subgraph because of this, but there is a more general solution which is data source de-duplication.

What about defining these functions in the query itself?

Consumers can already do whatever calculation they want on top of the data, if this were moved into queries the re-use and abstraction value would be lost.

and then in queries have users write

I don't see the value in having inputs written in queries when the query doesn't have any control over them.

I think allowing subqueries as arguments to user-supplied functions would greatly reduce the need for store.get.

That would be a nice optimization, but I think we can do it backwards-compatibly by having store.get return a pre-fetched object rather than hitting the store.

leoyvens commented 5 years ago

Conclusions from our meeting on this (@Jannis @lutter do say if I forgot an important point):

I've updated the plan and tried to break up the estimates more, and give myself more time. The detail level in the tasks is not great but I'm at the limit of my capacity to plan and estimate ahead.

leoyvens commented 5 years ago

A design issue that wasn't covered in the plan is what ABIs the resolver module will have access to for the contract calls. I see two options:

  1. The resolver specifies its own ABIs. I'm not a fan of this because it puts ethereum specific information in the schema section, and that ABI is probably duplicated from a data source anyways. The manifest would look like:

    - schema:
    - file: ./schema.graphql
    - ethereumAbis:
      - name: Foo
      - file: ./abis/Foo.json
  2. The resolver can implicitly access any ABI defined in the manifest. This is my preference, it doesn't require changes to the manifest and the resolver can import any generated contract class and it will just work. The concern is if there are two ABIs with the same name pointing to different files. We should detect and warn on this case, it doesn't seem worth supporting and it's plausible that in the future we'll change the manifest structure so that all ABIs are in fact in one place.

davekaj commented 5 years ago

I believe we are prioritizing this for the compound subgraph. @Jannis @leoyvens

I imagine this will have to get queued up for engineering priorities?

davekaj commented 5 years ago

@Zerim Did you have thoughts on how to implement this in the past?

Zerim commented 5 years ago

As I mentioned in Slack my main interest here was that the interface for custom resolvers be implemented in the client-side query engine rather than adding additional complexity to the graph node.

I think the resolvers could be written in WASM, but they could also simply be written in Javascript (ES5), since it will be running on the end user's machine and that is what the query engine is transpiled to.

This would be related to the work of decoupling the query engine from the graph node and making the main query interface for the graph node use the read interface that @lutter has been designing for the hybrid network.

I only scanned the conversation above, but one piece of feedback is that I would like to keep the data model (GraphQL schema) free from concerns around how those fields are computed, indexed, etc. etc. That information should go elsewhere in the manifest or mappings.

un7c0rn commented 3 years ago

Following up from a Discord thread, this seems very useful. The case we have is as follows:

Royalties are deposited into a contract address over a time period. There are multiple rightsholders that are entitled to these royalties –– each in proportion to the number of ERC-20 tokens they hold.

When a rightsholder attempts to withdraw royalties, the computation is essentially:

withrawal_amount = sum_over_time(tokens_received - tokens_sent) * sum(royalties) / total_number_of_tokens

This computation is infeasible on-chain so instead we use a graph node to index the transfer events. The problem is that there's no native support in GraphQL to perform the filtered sum operation. Implementing this in the form of a query engine means we need a 3rd component in the architecture, e.g. an off chain service just to execute the query, when really what we want is to execute the query from the chain and to have a query time computation done on the node.

reubenr0d commented 1 year ago

+1, this feature would be really useful. But this issue seems stale. Are there any plans of implementing this?