freiksenet commented 9 years ago

Hello!

In our system we took GraphQL query and translated it to ReQL (RethinkDB query language) for execution. This allowed us to optimize the query, eg use built-in joins in ReQL instead of querying the database two times. I can see how the similar system can be used for, eg, SQL queries. We also did some further optimizations, like returning only the fields that are requested by GraphQL, reducing the response payload size for tables with many keys.

I see a way to do it graphql-js by manually going through selection sets in resolve, but that feels very clumsy and hacky, as the library does this walking already. Some ways to do it would be to allow children resolve to modify root or to split collect/resolve/complete steps into independent parts that can be used separately.

Do you think deferred execution and query translation is a good idea? Should we stick to current model and do multiple requests to the database? Is there some solution to that, which I don't see?

Thanks!

freiksenet commented 9 years ago

Mapping vs reducing executor

Discussed this with @dschafer in IRC a bit, he asked to provide some concrete examples of what we would like to be able to do with executor.

As I understood, the current executor was made with the assumption that it's okay to make multiple round trips, so it's built from top down. If resolve is defined on some lower level fields, they get access to parent result and can query backend once more. One can say current executor uses map over tree, calling appropriate resolve function on each node.

In our system, we assumed that extra round-trips are bad and that, as our system allows it, we would like to be able to express GraphQL queries with one query, when it is possible. We also figured we can do optimizations along the way, like requesting only the selection set fields in the query. Our executor collected results of what is similar to resolve in graphql-js and then combined them together. One can say we did reduce over the tree, calling the resolve function along the way and then combining the result.

Let's imagine schema:

type User {
  id: String!
  name: String
  posts: [Post]
}

type Post {
  id: String!
  text: String
  createdAt: DateTime
  author: User
}

Mapping executor

We assume we have fields in query to get User and Post by id. Resolve for one of them would look something like:

resolve(parent, {id}, {db, conn}) {
  return db.table('User').get(id).run(conn); // Promise
}

For relations (like author or posts) we'll have something like

resolve(parent, args, {db, conn}) {
  return db.table('Posts').get(parent.author).run(conn); 
}

Reducing executor, version 1

This is approximately how we do it.

Here we would need some abstraction, which can later be converted to database query. We call it Query (duh), the main idea is that it's easily convertible to actual database request.

resolve(parent, {id}, {db, conn}) {
  let baseQuery = new Query({
    selector: new IDSelector({id: id, tableName: tableName}),
  });
  // this calls resolve on each child and returns results (queries)
  let childrenResults = this.resolveChildren(baseQuery); 
  let finalQuery = combineQueries(baseQuery, childrenResults); // combines query together
  // We actually just return the query and run and later stages
  return finalQuery.run(db, conn);
}

Relation resolves would be like this:

resolve(baseQuery) {
  let newQuery = new Query({
    selector: new RelatedSelector({
      tableName: tableName, 
      relatedField: fieldName
    }),
  });

  let childrenResults = this.resolveChildren(newQuery);
  let finalQuery = combineQueries(newQuery, childrenResults);
  return baseQuery.addJoin(fieldName, newQuery)
}

One can also apply some optimizations by having custom resolves for scalars.

resolve(baseQuery) {
  return baseQuery.addToSelect(fieldName)
}

Reducing executor, version 2

What I propose is to add reduce method to the fields, in addition to resolve. The default implementation of reduce would do what it does now - take the result of resolve of children and put it to appropriate places in the object. Simplified example:

reduce(ownResult, childrenResultMap) {
  for (let key of childrenResultMap.keys()) {
    ownResult[key] = childrenResultMap[key];
  }
  return ownResult;
}

This would convert what we had in version 1 to:

// query field
resolve(parent, {id}) {
  return new Query({
    selector: new IDSelector({id: id, tableName: tableName}),
  });
  return baseQuery;
}

reduce(parent, ownResult, childrenResultMap, {conn}) {
  return combineQueries(ownResult, childrenResultMap).run(conn)
}

// join
resolve(parent, {id}) {
  return new Query({
    selector: new RelatedSelector({
      tableName: tableName, 
      relatedField: fieldName
    }),
  });
}

reduce(parent, ownResult, childrenResultMap) {
  return parent.addJoin(
    combineQueries(ownResult, childrenResultMap)
  );
}

ecstasy2 commented 9 years ago

The idea of a reducing executor is appealing, but I think any reducing executor will have to make assumption on the domain specific structure of the application.

An alternative I am using is queries collapsing. It kind of achieve the same goal but without making the spec too opinionated. Basically a resolver send to a custom query merger module and listen to an event on the merger before resolving the promise.

The query merger is configured to reduce round trips to the db and would wait a specified amount of time to receive more queries to merge and send to the database.

ecstasy2 commented 9 years ago

For example all mapping executor you provided assume that a relationship is actually in a different location and thus require a query to be resolved. While if I am using a document based datastore I might not need that query.

freiksenet commented 9 years ago

I don't think it makes any assumptions, I was just showing how it solves the problem in my domain. In your alternative solution, reducer could collect individual queries in reduce step and execute them all together at the end. Similarly, executor doesn't dictate where you add reduce and resolve, so if you keep your data inline, you just won't need to add them.

pushrax commented 9 years ago

I agree with this approach, and it seems general enough to be included in the spec. It's still useful with document stores for limiting the requested fields.

Using a debounced merger seems like a very specific solution for multithreaded or evented architectures, and has implications on latency. In addition, the query flow becomes much more difficult to reason about.

I'm interested in how FB approaches this; you probably have something figured out already.

devknoll commented 9 years ago

There's a little bit of discussion about this going on in #19 too, might help to merge the discussion somewhere.

ecstasy2 commented 9 years ago

@pushrax The debounced merger I suggested is definitely not a solution that I am suggesting for the spec. I just noted how I am solving this particular issue in the context of my implementation.

The problem I see with the reducing solution above is that it assume that we could always find a way to reduce field selection into smaller queries. I am just saying that the reduce part will always be domain specific and might not be a good fit to be in the spec.

My vote would be to the resolver functions enough meta data on the context of the selection so that kind of implementation is possible.

leebyron commented 9 years ago

The problem I see with the reducing solution above is that it assume that we could always find a way to reduce field selection into smaller queries. I am just saying that the reduce part will always be domain specific and might not be a good fit to be in the spec.

I fear a similar outcome, but I'm interested in what could become a more general utility to provide.

Right now we offer the field AST as one of the arguments to the resolve function, but I agree this is not enough to implement an effective reducer. It doesn't make fragment resolution or sub-field merging easy to manage.

Any reducer needs to support this case:

{
  me {
    ...friendNames
    ...friendBirthdays
  }
}

fragment friendNames on User {
  friends {
    name
  }
}

fragment friendBirthdays on User {
  friends {
    birthday
  }
}

Once reduced, the query executor is still going to want to use a top-down descent into the result so that it can ensure the returned result adheres to the type system contract: that non-nullables are in fact not null, that scalars are of the correct types, and that no additional fields have leaked through.

dschafer commented 9 years ago

Chatted with @leebyron just now.

It seems pretty clear that we're missing something here to make this case easier. There seem to be a couple of approaches that might work:

The core executor could implement a reducer as you suggest.
The core could add a collect step as you suggest, that traverses the tree from bottom to top, so that every resolve method gets the result of resolving its parent as well as the result of collecting its children as inputs.
One could write a library that allows the type definitions to be specified in whatever manner is easiest for you, and that produces type definitions that use the existing API.
Others that we haven't even thought of yet.

My instincts personally run towards the collect step, but I think the way we'll get to the ideal solution here is by experimentation and trying things out.

So I'd suggest you take whichever option makes the most sense to you and build it as an temporary extension to the core right now; for example, if you wanted to build out the "collect" pass, it could be implemented as a temporary step prior to execution that uses the TypeInfo and visitor abstractions in the core to construct a "collected data" object, which would be passed into the execute function as the root object. So you wouldn't use the built-in graphql function, but instead build one that does your extension; for example, it might parse, validate, collect, then execute.

By building out a solution as a temporary extension to the core, we'll be able to figure out which of the options ends up being easiest and cleanest, and we'll see if different implementations end up requiring similar techniques. Once we know that, we'll hopefully discover the general piece that we're missing, and figure out the right way to add it.

Super excited to see how this development proceeds, I think seeing what the temporary extension looks like will make it really clear what the right changes to make to the core are!

freiksenet commented 9 years ago

@dschafer I don't quite understand what's your idea about collect. What exactly would "collecting children" imply?

@leebyron I don't see a problem with reducer and the fragments, fragments seem to be resolved to flat field map on collectFields stage, so for the reducer there would be no difference between having fragments or just getting both fields 'directly'.

And yes, reduce would be domain specific, exactly like resolve is domain specific. It's just tools you can use to customize the way the executor work. I've given an example of generic/default reduce, which does exactly what executeFields does after executing the field.

I'll try to hack implementation to illustrate what I mean, I think it'd be easier :)

freiksenet commented 9 years ago

Ah, I think I get what @dschafer means. That would work too.

freiksenet commented 9 years ago

Riiight, I was wrong about collectFields and fragments. Now I get what @leebyron was talking about. Back to the drawing board.

leebyron commented 9 years ago

Hey all, we're opening up a community slack to have real time discussion about GraphQL. This is probably an interesting topic to continue on. https://graphql-slack.herokuapp.com

ruslantalpa commented 8 years ago

Has anybody made any progress on this, any other place with a discussion related to this? A specific example that might help push the conversation forward.

You have the following types client, project, task with the obvious relations and resolvers in place.

type Client {
    id: String!
    name: String
    projects: [Project]
}

type Project {
    id: String!
    name: String
    client: Client
    tasks: [Task]
}

type Task {
    id: String!
    name: String
    project: Project
}

What i've seen so far, the ways the resolvers are written, a query like the one below

{
    client(id: 1) {
        id
        name
        projects {
            id
            name
            tasks {
                id
                name
            }
        }
    }
}

would generate the folowing queries to the backend (DB)

select id, name from clients where id = 1; -- resolver A
select id, name from projects where client_id = 1; -- resolver B
select id, name from tasks where project_id = 1; -- resolver C
select id, name from tasks where project_id = 2; -- resolver C
select id, name from tasks where project_id = 3; -- resolver C
....

The major problem is the fact that "resolver C" is called multiple times, once for each "project" item that was returned by "resolver B".

All the queries above can be "reduced" to the one below (PostgreSQL specific).

select 
    "clients"."id",
    "clients"."name",
    coalesce(
        (
            select 
                array_to_json(array_agg(row_to_json("projects")))
            from
            (
                select 
                    "projects"."id", 
                    "projects"."name", 
                    coalesce(
                        (
                            select array_to_json(array_agg(row_to_json("tasks")))
                            from
                            (
                                select "tasks"."id", "tasks"."name"
                                from "tasks"
                                where "tasks"."project_id" = "projects"."id"
                            ) "tasks"
                        ),
                        '[]'
                ) as "tasks"
                from "projects"
                where "projects"."client_id" = "clients"."id"
            ) "projects"
        ),
        '[]'
    ) as "projects"
from "clients"
where "clients"."id" = '1'

helfer commented 8 years ago

@ruslantalpa If you use smart batching with something like dataloader, what you'll actually see is this:

select id, name from clients where id = 1; -- resolver A
select id, name from projects where client_id = 1; -- resolver B
select id, name from tasks where project_id in (1, 2, 3); -- resolver C

The merging you speak of only makes sense if all of these resolvers actually fire a query to the sql database, and the execution planner has no way of knowing this at the moment. It is however possible to do this optimization 'manually' in your resolve function for client or project, because you have access to the AST. So instead of just fetching the client and then passing that to the project field, you could run the entire query with JOIN and pass the JSON object down, in which case your project and tasks resolvers won't have to run any queries any more. The disadvantage now is that your resolvers become closely tied to each other, which makes it harder and harder to change the schema without a lot of effort. In my opinion, it's an optimization that should only be used as a last resort.

What I described above might be a bit faster, but it might also not be. You can't really tell until you try this out with different parameters. If you're thinking about doing that comparison, I'd be very interested in seeing the results. I've been wanting to do something like that myself, but I have yet to find the time for it.

If someone else has time and interest in doing such a comparison, just let me know and I can tell you about the experimental setup that I had in mind for it.

JeffRMoore commented 8 years ago

@ruslantalpa Thanks for the example. I'm actively working on this, at my current rate it will take me several weeks to reformulate #304 to address the excellent feedback received.

The resolving algorithm is quite elegant, I don't feel constrained by it. I DO think that the resolve interface is focused on the wrong abstraction. The parameters passed belong to the world of the query. I believe the parameters passed to the resolve should belong to the world of the schema. The difference is subtle.

Think of it this way. What if one wanted to have a completely different query language, but use the current GraphQL schema and resolve process? The current resolve implementation is tightly coupled to the syntax of the query. This is an obstacle to a vibrant ecosystem. The coupling needs to be attacked. After the parameter impedance mismatch is eliminated, I believe it will take confusion off the table and unlock the path to new solutions for this problem.

So ... what would parameters to resolve look like that would allow you to easily generate that PostgreSQL query?

ruslantalpa commented 8 years ago

@helfer are you saying that your 3 queries are (might be) faster then the one i provided? I guarantee they are not, simply because your 3rd query has to wait for the 2nd to complete (and this is a trivial example, only 2 sequential steps, most of the time there will be more). Something like dataloader is not going to beat a PG query planner. About doing the optimisation in the resolver, i don't think it's possible what you are suggesting, even if fetch all the data with a top level resolver, there is no way to "pass it down" unless you are talking about each resolver first checking some global "cache object" to see if the data was already fetched, and this breaks down if different parts of the AST tree fetch the same object but with a different set of fields.

@JeffRMoore I am not sure how the parameters to resolvers will help solve this problem, but i haven't given enough time you what you said (nor to your pr) so you might be right. I think the main problem is that for each query there is a fixed number of resolvers that need to be executed (equal to the number of nodes in the AST) and you can't combine/merge those resolvers while leaving the number of nodes the same. What (I think) needs to happen is to allow the possibility of another function "traversing" the AST tree prior to executing it and "collapsing" the resolvers, i.e. creating custom optimised resolvers on the fly (maybe even by composing the original resolvers) and attaching them to the specific nodes in the AST. A good start for that would be changing this line from

const resolveFn = fieldDef.resolve || defaultResolveFn;

to

const resolveFn = fieldAST.resolve || fieldDef.resolve || defaultResolveFn;

Sure there other places that need changes but this is the core of the idea, there needs to be a way to create "resolvers" on the fly and attaching them to the AST. This alone might allow for other functions to dynamically reducing the number of resolvers that need execution, in places that permit that (all going to the same backend).

taion commented 8 years ago

That array_to_json thing is quite specific to your use case, though, and even presupposes that everything can be converted meaningfully to JSON.

In the general case, for that sort of to-many relationship, most ORMs even (with full knowledge of the query) use a subquery load strategy, which does follow @helfer's strategy of issuing multiple queries to the database, and doing the join in-memory after receiving the database rows, to avoid the cartesian explosion issues from actually running this query in the straightforward way with a join.

This isn't to say that the idea of the reducing executor isn't useful – it's just that it's less useful than you might think, even in the context of e.g. talking directly to a SQL database, because the general approach for handling to-many relationships is to use an extra query anyway, which is exactly what you get out of the DataLoader.

taion commented 8 years ago

tl;dr waterfalled loads with DataLoader closely approximate subquery load strategies for relationships in SQL ORMs, which most ORMs use by default for to-many relationships anyway, so the bar for a reducing executor being useful in at least that context is higher than you'd initially think

helfer commented 8 years ago

@ruslantalpa You're probably right that three queries won't be faster than the one joined on in most cases, but I'm not as convinced as you are that it's impossible. There are definitely some cases that I'd want to test for.

That's not really the main point though. The main point is that any difference might not matter at all. As @taion said, ORMs work in a very similar way, and for many (if not most) use cases, they are completely appropriate. The reason people use ORMs is the same reason I would suggest sticking to making multiple queries in the first implementation: simplicity. Even if it is slower, I wouldn't want to make my GraphQL server more complicated by optimizing, unless I really had to. Anything else seems like premature optimization to me.

You can definitely pass the data down from the top level resolvers. The first argument to the resolve function of a field is simply what the parent's resolve function returned. If your root resolve function returns an object that has the exact shape of the data queried for, then you don't even need to define your other resolvers. Does that make sense?

JeffRMoore commented 8 years ago

I don't think the goal is necessarily to reduce the number of resolve calls, its to reduce the number of backend queries generated by calling the resolve functions.

taion commented 8 years ago

@helfer

The subquery load is actually better than a straight join, because otherwise you end up fetching a bunch of duplicate rows. Without doing dumb JSON tricks, that query above becomes:

SELECT
  clients.id AS client_id,
  clients.name AS client_name,
  projects.id AS project_id,
  projects.name AS project_name,
  tasks.id AS task_id,
  tasks.name AS task_name
FROM
  clients
LEFT OUTER JOIN
  projects
ON
  projects.client_id = clients.id
LEFT OUTER JOIN
  tasks
ON
  tasks.project_id = projects.id

This is rather less nice than the subquery load with the join done by the SQL client.

taion commented 8 years ago

SQL ORMs don't do subquery loads over joined loads for to-many relations because it's easier – they do subquery loads over joined loads because it avoids making the database return a lot of extra data to you.

helfer commented 8 years ago

@taion Yeah, that's a good point. I've thought about that before, but I wasn't sure if database engines do some optimization by compressing the duplicate data. I would have guessed that they must be doing that if it's a performance concern.

taion commented 8 years ago

I'm going off of http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html#what-kind-of-loading-to-use, which suggests that (at least for Python) this sort of thing isn't optimized away, most of the time.

helfer commented 8 years ago

@taion Thanks for the link, that's a great resource! It also touches into another aspect which we haven't even mentioned here: you could cache the nodes not just on a per-request basis, but across requests. We've been looking into doing that in Apollo.

ruslantalpa commented 8 years ago

@helfer you are right about passing data down (looking into it). Using that one could create "smarter" resolvers that can prefetch data for their children (and with that eliminate the need to reduce the number of resolvers). I don't think the coupling is a problem since its a thing dictated by the problem being modeled (client project relation is not going to change)

@JeffRMoore reducing the number of resolvers would have the same effect, but you are right, strictly speaking the goal is to reduce the number of backend trips.

@taion don't want to start a flame war about the ORMs :) but just because they are doing it that way does not mean they are doing it the right way. If you look more closely at my query you'll see there is no "cartesian explosion", the shape of the response is EXACTLY the shape GQL will return. The fact that it uses functions like array_to_json has nothing to do with the executor or generality this implementation needs to maintain, that is handled by my resolvers anyway so i am taking advantage of the features the backed (PostgreSQL) offers me. The reason the ORMs are doing it that (suboptimal) way is because they target all the databases so they use only features available in all of them (standard sql) but since i am using only one backend (actually everybody does when using an ORM) then i am going to take advantage of the features it offers (and throw away broken abstraction)

In conclusion, first of all i would like to apologies for derailing the discussion a bit. While in context of "sql backend" what @helfer suggested, solves the problem of optimising the backend trips, there might be situations where this is useful. What are your thoughts of providing a way for "reducing" the number of resolvers (that generate backend requests) by allowing resolvers to exist independent of "type definitions" and attached to the AST nodes. This would give a way for an outside function to control to some extent the "execution" phase

taion commented 8 years ago

@ruslantalpa

Even in the PostgreSQL context, your example only works if you're okay with all of your data being coerced into JSON. It's not generally appropriate, and you're also making the database do more work with type coercion. If any of those are e.g. TIMESTAMP fields or whatever, then the solution you have doesn't work. Your choice of query there isn't broken per se, but even targeting a single database backend, it's not generically applicable, in that it requires specific assumptions as to JSON serializability.

That doesn't mean that I think this sort of up-front planning is a bad idea, necessarily – the bar just has to be higher than "it forces me to do the equivalent of a waterfalled subquery load".

ruslantalpa commented 8 years ago

@taion it works in all cases since gql has to return json, so the data needs to be serialisable and coercion in DB is always faster then in any scripting language, let's move this discussion to slack if you are interested

taion commented 8 years ago

That's only in the case where you don't need to do anything with that value on the database client application. Again, I'm not saying that what you have doesn't work for your use case – but it's not a generic solution.

I don't object to the idea of having support for this sort of thing; just that the scope is necessarily limited, and the value-add over DataLoader for general use cases has limited scope – not zero scope, but not full scope either.

syrusakbary commented 8 years ago

I like the idea of reducing executor. I'm running in a similar problem when trying to optimize the queries to the DB when querying through GraphQL.

I got a way to make it work with graphene but it feels hacky

yaacovCR commented 2 years ago

I don't believe I followed the entire reducer discussion as above. I'm going to try to hit some of the examples mentioned above in the links and try to see how far I get, but can anybody report back in terms of how the explorations mentioned above worked out?

Or does anybody have anything more fleshed out, updated for 2021?

Possibly related.... #3303

graphql / graphql-js

Reducing executor #26

Mapping vs reducing executor

Mapping executor

Reducing executor, version 1

Reducing executor, version 2