Open freiksenet opened 9 years ago
Discussed this with @dschafer in IRC a bit, he asked to provide some concrete examples of what we would like to be able to do with executor.
As I understood, the current executor was made with the assumption that it's
okay to make multiple round trips, so it's built from top down. If resolve is
defined on some lower level fields, they get access to parent result and can
query backend once more. One can say current executor uses map
over tree,
calling appropriate resolve
function on each node.
In our system, we assumed that extra round-trips are bad and that, as our
system allows it, we would like to be able to express GraphQL queries with one
query, when it is possible. We also figured we can do optimizations along the
way, like requesting only the selection set fields in the query. Our executor
collected results of what is similar to resolve
in graphql-js and then
combined them together. One can say we did reduce
over the tree, calling the
resolve
function along the way and then combining the result.
Let's imagine schema:
type User {
id: String!
name: String
posts: [Post]
}
type Post {
id: String!
text: String
createdAt: DateTime
author: User
}
We assume we have fields in query
to get User
and Post
by id. Resolve for
one of them would look something like:
resolve(parent, {id}, {db, conn}) {
return db.table('User').get(id).run(conn); // Promise
}
For relations (like author
or posts
) we'll have something like
resolve(parent, args, {db, conn}) {
return db.table('Posts').get(parent.author).run(conn);
}
This is approximately how we do it.
Here we would need some abstraction, which can later be converted to database
query. We call it Query
(duh), the main idea is that it's easily convertible
to actual database request.
resolve(parent, {id}, {db, conn}) {
let baseQuery = new Query({
selector: new IDSelector({id: id, tableName: tableName}),
});
// this calls resolve on each child and returns results (queries)
let childrenResults = this.resolveChildren(baseQuery);
let finalQuery = combineQueries(baseQuery, childrenResults); // combines query together
// We actually just return the query and run and later stages
return finalQuery.run(db, conn);
}
Relation resolves would be like this:
resolve(baseQuery) {
let newQuery = new Query({
selector: new RelatedSelector({
tableName: tableName,
relatedField: fieldName
}),
});
let childrenResults = this.resolveChildren(newQuery);
let finalQuery = combineQueries(newQuery, childrenResults);
return baseQuery.addJoin(fieldName, newQuery)
}
One can also apply some optimizations by having custom resolves for scalars.
resolve(baseQuery) {
return baseQuery.addToSelect(fieldName)
}
What I propose is to add reduce
method to the fields, in addition to resolve
. The default
implementation of reduce would do what it does now - take the result of resolve
of children and put it to appropriate places in the object. Simplified example:
reduce(ownResult, childrenResultMap) {
for (let key of childrenResultMap.keys()) {
ownResult[key] = childrenResultMap[key];
}
return ownResult;
}
This would convert what we had in version 1 to:
// query field
resolve(parent, {id}) {
return new Query({
selector: new IDSelector({id: id, tableName: tableName}),
});
return baseQuery;
}
reduce(parent, ownResult, childrenResultMap, {conn}) {
return combineQueries(ownResult, childrenResultMap).run(conn)
}
// join
resolve(parent, {id}) {
return new Query({
selector: new RelatedSelector({
tableName: tableName,
relatedField: fieldName
}),
});
}
reduce(parent, ownResult, childrenResultMap) {
return parent.addJoin(
combineQueries(ownResult, childrenResultMap)
);
}
The idea of a reducing executor is appealing, but I think any reducing executor will have to make assumption on the domain specific structure of the application.
An alternative I am using is queries collapsing. It kind of achieve the same goal but without making the spec too opinionated. Basically a resolver send to a custom query merger module and listen to an event on the merger before resolving the promise.
The query merger is configured to reduce round trips to the db and would wait a specified amount of time to receive more queries to merge and send to the database.
For example all mapping executor you provided assume that a relationship is actually in a different location and thus require a query to be resolved. While if I am using a document based datastore I might not need that query.
I don't think it makes any assumptions, I was just showing how it solves the problem in my domain. In your alternative solution, reducer could collect individual queries in reduce step and execute them all together at the end. Similarly, executor doesn't dictate where you add reduce
and resolve
, so if you keep your data inline, you just won't need to add them.
I agree with this approach, and it seems general enough to be included in the spec. It's still useful with document stores for limiting the requested fields.
Using a debounced merger seems like a very specific solution for multithreaded or evented architectures, and has implications on latency. In addition, the query flow becomes much more difficult to reason about.
I'm interested in how FB approaches this; you probably have something figured out already.
There's a little bit of discussion about this going on in #19 too, might help to merge the discussion somewhere.
@pushrax The debounced merger I suggested is definitely not a solution that I am suggesting for the spec. I just noted how I am solving this particular issue in the context of my implementation.
The problem I see with the reducing solution above is that it assume that we could always find a way to reduce field selection into smaller queries. I am just saying that the reduce part will always be domain specific and might not be a good fit to be in the spec.
My vote would be to the resolver functions enough meta data on the context of the selection so that kind of implementation is possible.
The problem I see with the reducing solution above is that it assume that we could always find a way to reduce field selection into smaller queries. I am just saying that the reduce part will always be domain specific and might not be a good fit to be in the spec.
I fear a similar outcome, but I'm interested in what could become a more general utility to provide.
Right now we offer the field AST as one of the arguments to the resolve
function, but I agree this is not enough to implement an effective reducer. It doesn't make fragment resolution or sub-field merging easy to manage.
Any reducer needs to support this case:
{
me {
...friendNames
...friendBirthdays
}
}
fragment friendNames on User {
friends {
name
}
}
fragment friendBirthdays on User {
friends {
birthday
}
}
Once reduced, the query executor is still going to want to use a top-down descent into the result so that it can ensure the returned result adheres to the type system contract: that non-nullables are in fact not null, that scalars are of the correct types, and that no additional fields have leaked through.
Chatted with @leebyron just now.
It seems pretty clear that we're missing something here to make this case easier. There seem to be a couple of approaches that might work:
My instincts personally run towards the collect step, but I think the way we'll get to the ideal solution here is by experimentation and trying things out.
So I'd suggest you take whichever option makes the most sense to you and build it as an temporary extension to the core right now; for example, if you wanted to build out the "collect" pass, it could be implemented as a temporary step prior to execution that uses the TypeInfo
and visitor
abstractions in the core to construct a "collected data" object, which would be passed into the execute
function as the root object. So you wouldn't use the built-in graphql
function, but instead build one that does your extension; for example, it might parse, validate, collect, then execute.
By building out a solution as a temporary extension to the core, we'll be able to figure out which of the options ends up being easiest and cleanest, and we'll see if different implementations end up requiring similar techniques. Once we know that, we'll hopefully discover the general piece that we're missing, and figure out the right way to add it.
Super excited to see how this development proceeds, I think seeing what the temporary extension looks like will make it really clear what the right changes to make to the core are!
@dschafer I don't quite understand what's your idea about collect
. What exactly would "collecting children" imply?
@leebyron I don't see a problem with reducer and the fragments, fragments seem to be resolved to flat field map on collectFields
stage, so for the reducer there would be no difference between having fragments or just getting both fields 'directly'.
And yes, reduce
would be domain specific, exactly like resolve
is domain specific. It's just tools you can use to customize the way the executor work. I've given an example of generic/default reduce
, which does exactly what executeFields
does after executing the field.
I'll try to hack implementation to illustrate what I mean, I think it'd be easier :)
Ah, I think I get what @dschafer means. That would work too.
Riiight, I was wrong about collectFields
and fragments. Now I get what @leebyron was talking about. Back to the drawing board.
Hey all, we're opening up a community slack to have real time discussion about GraphQL. This is probably an interesting topic to continue on. https://graphql-slack.herokuapp.com
Has anybody made any progress on this, any other place with a discussion related to this? A specific example that might help push the conversation forward.
You have the following types client, project, task with the obvious relations and resolvers in place.
type Client {
id: String!
name: String
projects: [Project]
}
type Project {
id: String!
name: String
client: Client
tasks: [Task]
}
type Task {
id: String!
name: String
project: Project
}
What i've seen so far, the ways the resolvers are written, a query like the one below
{
client(id: 1) {
id
name
projects {
id
name
tasks {
id
name
}
}
}
}
would generate the folowing queries to the backend (DB)
select id, name from clients where id = 1; -- resolver A
select id, name from projects where client_id = 1; -- resolver B
select id, name from tasks where project_id = 1; -- resolver C
select id, name from tasks where project_id = 2; -- resolver C
select id, name from tasks where project_id = 3; -- resolver C
....
The major problem is the fact that "resolver C" is called multiple times, once for each "project" item that was returned by "resolver B".
All the queries above can be "reduced" to the one below (PostgreSQL specific).
select
"clients"."id",
"clients"."name",
coalesce(
(
select
array_to_json(array_agg(row_to_json("projects")))
from
(
select
"projects"."id",
"projects"."name",
coalesce(
(
select array_to_json(array_agg(row_to_json("tasks")))
from
(
select "tasks"."id", "tasks"."name"
from "tasks"
where "tasks"."project_id" = "projects"."id"
) "tasks"
),
'[]'
) as "tasks"
from "projects"
where "projects"."client_id" = "clients"."id"
) "projects"
),
'[]'
) as "projects"
from "clients"
where "clients"."id" = '1'
@ruslantalpa If you use smart batching with something like dataloader, what you'll actually see is this:
select id, name from clients where id = 1; -- resolver A
select id, name from projects where client_id = 1; -- resolver B
select id, name from tasks where project_id in (1, 2, 3); -- resolver C
The merging you speak of only makes sense if all of these resolvers actually fire a query to the sql database, and the execution planner has no way of knowing this at the moment. It is however possible to do this optimization 'manually' in your resolve function for client or project, because you have access to the AST. So instead of just fetching the client and then passing that to the project field, you could run the entire query with JOIN and pass the JSON object down, in which case your project and tasks resolvers won't have to run any queries any more. The disadvantage now is that your resolvers become closely tied to each other, which makes it harder and harder to change the schema without a lot of effort. In my opinion, it's an optimization that should only be used as a last resort.
What I described above might be a bit faster, but it might also not be. You can't really tell until you try this out with different parameters. If you're thinking about doing that comparison, I'd be very interested in seeing the results. I've been wanting to do something like that myself, but I have yet to find the time for it.
If someone else has time and interest in doing such a comparison, just let me know and I can tell you about the experimental setup that I had in mind for it.
@ruslantalpa Thanks for the example. I'm actively working on this, at my current rate it will take me several weeks to reformulate #304 to address the excellent feedback received.
The resolving algorithm is quite elegant, I don't feel constrained by it. I DO think that the resolve interface is focused on the wrong abstraction. The parameters passed belong to the world of the query. I believe the parameters passed to the resolve should belong to the world of the schema. The difference is subtle.
Think of it this way. What if one wanted to have a completely different query language, but use the current GraphQL schema and resolve process? The current resolve implementation is tightly coupled to the syntax of the query. This is an obstacle to a vibrant ecosystem. The coupling needs to be attacked. After the parameter impedance mismatch is eliminated, I believe it will take confusion off the table and unlock the path to new solutions for this problem.
So ... what would parameters to resolve look like that would allow you to easily generate that PostgreSQL query?
@helfer are you saying that your 3 queries are (might be) faster then the one i provided? I guarantee they are not, simply because your 3rd query has to wait for the 2nd to complete (and this is a trivial example, only 2 sequential steps, most of the time there will be more). Something like dataloader is not going to beat a PG query planner. About doing the optimisation in the resolver, i don't think it's possible what you are suggesting, even if fetch all the data with a top level resolver, there is no way to "pass it down" unless you are talking about each resolver first checking some global "cache object" to see if the data was already fetched, and this breaks down if different parts of the AST tree fetch the same object but with a different set of fields.
@JeffRMoore I am not sure how the parameters to resolvers will help solve this problem, but i haven't given enough time you what you said (nor to your pr) so you might be right. I think the main problem is that for each query there is a fixed number of resolvers that need to be executed (equal to the number of nodes in the AST) and you can't combine/merge those resolvers while leaving the number of nodes the same. What (I think) needs to happen is to allow the possibility of another function "traversing" the AST tree prior to executing it and "collapsing" the resolvers, i.e. creating custom optimised resolvers on the fly (maybe even by composing the original resolvers) and attaching them to the specific nodes in the AST. A good start for that would be changing this line from
const resolveFn = fieldDef.resolve || defaultResolveFn;
to
const resolveFn = fieldAST.resolve || fieldDef.resolve || defaultResolveFn;
Sure there other places that need changes but this is the core of the idea, there needs to be a way to create "resolvers" on the fly and attaching them to the AST. This alone might allow for other functions to dynamically reducing the number of resolvers that need execution, in places that permit that (all going to the same backend).
That array_to_json
thing is quite specific to your use case, though, and even presupposes that everything can be converted meaningfully to JSON.
In the general case, for that sort of to-many relationship, most ORMs even (with full knowledge of the query) use a subquery load strategy, which does follow @helfer's strategy of issuing multiple queries to the database, and doing the join in-memory after receiving the database rows, to avoid the cartesian explosion issues from actually running this query in the straightforward way with a join.
This isn't to say that the idea of the reducing executor isn't useful – it's just that it's less useful than you might think, even in the context of e.g. talking directly to a SQL database, because the general approach for handling to-many relationships is to use an extra query anyway, which is exactly what you get out of the DataLoader.
tl;dr waterfalled loads with DataLoader closely approximate subquery load strategies for relationships in SQL ORMs, which most ORMs use by default for to-many relationships anyway, so the bar for a reducing executor being useful in at least that context is higher than you'd initially think
@ruslantalpa You're probably right that three queries won't be faster than the one joined on in most cases, but I'm not as convinced as you are that it's impossible. There are definitely some cases that I'd want to test for.
That's not really the main point though. The main point is that any difference might not matter at all. As @taion said, ORMs work in a very similar way, and for many (if not most) use cases, they are completely appropriate. The reason people use ORMs is the same reason I would suggest sticking to making multiple queries in the first implementation: simplicity. Even if it is slower, I wouldn't want to make my GraphQL server more complicated by optimizing, unless I really had to. Anything else seems like premature optimization to me.
You can definitely pass the data down from the top level resolvers. The first argument to the resolve function of a field is simply what the parent's resolve function returned. If your root resolve function returns an object that has the exact shape of the data queried for, then you don't even need to define your other resolvers. Does that make sense?
I don't think the goal is necessarily to reduce the number of resolve calls, its to reduce the number of backend queries generated by calling the resolve functions.
@helfer
The subquery load is actually better than a straight join, because otherwise you end up fetching a bunch of duplicate rows. Without doing dumb JSON tricks, that query above becomes:
SELECT
clients.id AS client_id,
clients.name AS client_name,
projects.id AS project_id,
projects.name AS project_name,
tasks.id AS task_id,
tasks.name AS task_name
FROM
clients
LEFT OUTER JOIN
projects
ON
projects.client_id = clients.id
LEFT OUTER JOIN
tasks
ON
tasks.project_id = projects.id
This is rather less nice than the subquery load with the join done by the SQL client.
SQL ORMs don't do subquery loads over joined loads for to-many relations because it's easier – they do subquery loads over joined loads because it avoids making the database return a lot of extra data to you.
@taion Yeah, that's a good point. I've thought about that before, but I wasn't sure if database engines do some optimization by compressing the duplicate data. I would have guessed that they must be doing that if it's a performance concern.
I'm going off of http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html#what-kind-of-loading-to-use, which suggests that (at least for Python) this sort of thing isn't optimized away, most of the time.
@taion Thanks for the link, that's a great resource! It also touches into another aspect which we haven't even mentioned here: you could cache the nodes not just on a per-request basis, but across requests. We've been looking into doing that in Apollo.
@helfer you are right about passing data down (looking into it). Using that one could create "smarter" resolvers that can prefetch data for their children (and with that eliminate the need to reduce the number of resolvers). I don't think the coupling is a problem since its a thing dictated by the problem being modeled (client project relation is not going to change)
@JeffRMoore reducing the number of resolvers would have the same effect, but you are right, strictly speaking the goal is to reduce the number of backend trips.
@taion don't want to start a flame war about the ORMs :) but just because they are doing it that way does not mean they are doing it the right way. If you look more closely at my query you'll see there is no "cartesian explosion", the shape of the response is EXACTLY the shape GQL will return. The fact that it uses functions like array_to_json has nothing to do with the executor or generality this implementation needs to maintain, that is handled by my resolvers anyway so i am taking advantage of the features the backed (PostgreSQL) offers me. The reason the ORMs are doing it that (suboptimal) way is because they target all the databases so they use only features available in all of them (standard sql) but since i am using only one backend (actually everybody does when using an ORM) then i am going to take advantage of the features it offers (and throw away broken abstraction)
In conclusion, first of all i would like to apologies for derailing the discussion a bit. While in context of "sql backend" what @helfer suggested, solves the problem of optimising the backend trips, there might be situations where this is useful. What are your thoughts of providing a way for "reducing" the number of resolvers (that generate backend requests) by allowing resolvers to exist independent of "type definitions" and attached to the AST nodes. This would give a way for an outside function to control to some extent the "execution" phase
@ruslantalpa
Even in the PostgreSQL context, your example only works if you're okay with all of your data being coerced into JSON. It's not generally appropriate, and you're also making the database do more work with type coercion. If any of those are e.g. TIMESTAMP
fields or whatever, then the solution you have doesn't work. Your choice of query there isn't broken per se, but even targeting a single database backend, it's not generically applicable, in that it requires specific assumptions as to JSON serializability.
That doesn't mean that I think this sort of up-front planning is a bad idea, necessarily – the bar just has to be higher than "it forces me to do the equivalent of a waterfalled subquery load".
@taion it works in all cases since gql has to return json, so the data needs to be serialisable and coercion in DB is always faster then in any scripting language, let's move this discussion to slack if you are interested
That's only in the case where you don't need to do anything with that value on the database client application. Again, I'm not saying that what you have doesn't work for your use case – but it's not a generic solution.
I don't object to the idea of having support for this sort of thing; just that the scope is necessarily limited, and the value-add over DataLoader for general use cases has limited scope – not zero scope, but not full scope either.
I like the idea of reducing executor.
I'm running in a similar problem when trying to optimize the queries to the DB when querying through GraphQL
.
I got a way to make it work with graphene
but it feels hacky
I don't believe I followed the entire reducer discussion as above. I'm going to try to hit some of the examples mentioned above in the links and try to see how far I get, but can anybody report back in terms of how the explorations mentioned above worked out?
Or does anybody have anything more fleshed out, updated for 2021?
Possibly related.... #3303
Hello!
In our system we took GraphQL query and translated it to ReQL (RethinkDB query language) for execution. This allowed us to optimize the query, eg use built-in joins in ReQL instead of querying the database two times. I can see how the similar system can be used for, eg, SQL queries. We also did some further optimizations, like returning only the fields that are requested by GraphQL, reducing the response payload size for tables with many keys.
I see a way to do it graphql-js by manually going through selection sets in resolve, but that feels very clumsy and hacky, as the library does this walking already. Some ways to do it would be to allow children
resolve
to modify root or to split collect/resolve/complete steps into independent parts that can be used separately.Do you think deferred execution and query translation is a good idea? Should we stick to current model and do multiple requests to the database? Is there some solution to that, which I don't see?
Thanks!