Closed Amnesthesia closed 2 years ago
I hope this would be helpful for you https://gist.github.com/itkrt2y/1e1a947c71772044f5d67f358b4772fc
@itkrt2y, could you share some "before" and "after" SQL queries when that association loader is used? (AFAIK, there's no good way to batch has_many
associations, especially when pagination is used -- but I'd love to see an example to the contrary!)
In my experience, there are a few ways of handling has_many
s:
id
s, then load all the records by ID. I think this is how Facebook's TAO architecture works, and if you were using something like https://github.com/shopify/identity_cache, maybe it would be a good way to leverage your cache. But especially when paging though a filtered relation, it's a lot of work to identify which IDs to fetch. Then, you end up making two trips to the database: one to fetch IDs, then another to fetch records. Wouldn't it have been better to simply load the records at the beginning? (Maybe not, if you know that the batched load-by-id call will have other record in it, besides the ones from the relation.) WHERE
, OFFSET
, LIMIT
, etc applied to the relation after it's loaded. Won't ActiveRecord have to make another database roundtrip when those are applied? (Maybe this doesn't apply if you're returning unbounded result sets to clients, which is a no-no in the first place, but š¤· .)I have seen one way to actually batch SQL calls for has_many relations, using Postgres window functions, but I've never heard of anyone using it IRL: https://pganalyze.com/blog/efficient-graphql-queries-in-ruby-on-rails-and-postgres
Anyways, all that to say:
After those questions are answered, we can investigate options from getting from one to the other š»
@rmosolgo I created a sample app for my association dataloader => https://github.com/itkrt2y/rails-association-loader-sample
before/after of association dataloader
{users {blogs {title}}}
User Load (0.1ms) SELECT "users".* FROM "users"
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC [["user_id", 1]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC [["user_id", 2]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC [["user_id", 3]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC [["user_id", 4]]
Blog Load (0.2ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC [["user_id", 5]]
{users {blogDataloader {title}}}
User Load (0.1ms) SELECT "users".* FROM "users"
Blog Load (0.3ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]
{users {blogsConnection {nodes {title}}}}
User Load (0.2ms) SELECT "users".* FROM "users"
Blog Load (0.2ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ? [["user_id", 1], ["OFFSET", 0]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ? [["user_id", 2], ["OFFSET", 0]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ? [["user_id", 3], ["OFFSET", 0]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ? [["user_id", 4], ["OFFSET", 0]]
Blog Load (0.1ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ? [["user_id", 5], ["OFFSET", 0]]
{users {blogsConnectionDataloader {nodes {title}}}}
User Load (0.1ms) SELECT "users".* FROM "users"
Blog Load (0.3ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]
{patients {physicians {name}}}
Patient Load (0.1ms) SELECT "patients".* FROM "patients"
Physician Load (0.1ms) SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ? [["patient_id", 1]]
Physician Load (0.2ms) SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ? [["patient_id", 2]]
Physician Load (0.3ms) SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ? [["patient_id", 3]]
Physician Load (0.2ms) SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ? [["patient_id", 4]]
Physician Load (0.2ms) SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ? [["patient_id", 5]]
{patients {physiciansDataloader {name}}}
Patient Load (0.1ms) SELECT "patients".* FROM "patients"
Appointment Load (0.2ms) SELECT "appointments".* FROM "appointments" WHERE "appointments"."patient_id" IN (?, ?, ?, ?, ?) [["patient_id", 1], ["patient_id", 2], ["patient_id", 3], ["patient_id", 4], ["patient_id", 5]]
Thanks for sharing those examples. What if you include pagination parameters, for example, first: 3
?
It also works as expected
{users {blogsConnectionDataloader(first: 3) {nodes {title}}}}
Started POST "/graphql" for 127.0.0.1 at 2022-06-02 10:14:35 +0900
Processing by GraphqlController#execute as */*
Parameters: {"operationName"=>nil, "variables"=>{}, "query"=>"{\n users {\n blogsConnectionDataloader(first: 3) {\n nodes {\n title\n }\n }\n }\n}\n", "graphql"=>{"operationName"=>nil, "variables"=>{}, "query"=>"{\n users {\n blogsConnectionDataloader(first: 3) {\n nodes {\n title\n }\n }\n }\n}\n"}}
User Load (0.1ms) SELECT "users".* FROM "users"
Blog Load (0.6ms) SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]
as expected
I guess that's where I disagree. If the GraphQL query is asking for first: 3
, I would expect a bounded SELECT
of some kind (LIMIT
or WHERE
could both work). But by sending an unbounded select, we risk causing a lot of DB and application latency while results are fetched and prepared, then related objects are initialized by Rails.
Maybe unbounded selects are OK for your use cases, but I've seen them go wrong often enough that I can't recommend it as an approach š !
It sounds like we got to the end of our discussion here. Thanks again for sharing what you found! Hopefully it will be helpful to anyone else who is looking for the same behavior.
@rmosolgo Sometimes when it comes to has_many
relationships, where we know those items are likely to be requested again later on, we've ended up doing things like dataloader.with(Sources::ActiveRecord, Child).load_all(object.children.pluck(:id))
This doesn't look very nice and to me this looks like doing the same lookup twice in the hopes that dataloader will then have it cached for later. Could we push
these straight into dataloaders cache if we already have the records?
We could definitely add something like that... maybe:
results = object.children.each_with_object({}) { |child, hash| hash[child.id] = child }
# Add these `{ key => object }` pairs to the Dataloader source:
dataloader.with(Sources::ActiveRecord, Child).merge(results)
What do you think of that?
@rmosolgo Something like that would be awesome, would be even nicer if it just allowed you to .push(record)
since it would know how to create a cache key for each record anyway?
know how to create a cache key
That's the part I don't quite know about ... I know it can go from key to record, but how could it go from record to key?
@rmosolgo Ah yeah I see what you're saying, I think I had it a bit mixed up in my mind with the GraphQL::Batch
gem that have a cache_key(...)
method. Either way, any way of pushing into the existing cache would be awesome!
I suggested an implementation in https://github.com/rmosolgo/graphql-ruby/pull/4186
Yay thank you so much! That was fast!
Is your feature request related to a problem? Please describe.
Dataloaders are amazing for avoiding N+1 queries, but I find in many cases I've ended up with some N+1 queries regardless, and I haven't figured out a good way to solve it.
Admittedly, this is a failing on my end, but I find it difficult with the very limited amount of documentation and examples on dataloader sources, and I'm honestly surprised there aren't more information around online on how to implement some basic ones for
has_many
andhas_many .., through:
relationshipsIn many cases I've resorted to solutions like
But this then requires us to first fetch that join-table and get all the User IDs, and then load the users, to be able to return promises for each of those IDs.
The alternative would have been to cache the entire collection by the parents ID, but that's not really that helpful
Describe the solution you'd like
It would be amazing with some default dataloader sources like
dataloader.with(GraphQL::Dataloader::Association, User).load(parent, :association_name)
or similar, caching the records both with their ID (when they have been resolved) and under the parent's ID and association if that association gets resolved againIn general, it would be great with some more advanced examples or even built-in sources for the most common use-cases such as has-many associations
Describe alternatives you've considered
We often use lookaheads here to make sure we've joined associated tables and have them available in the query, and it happens that those lookaheads get wasted when a type uses a dataloader to return a field.
Admittedly I discovered lookaheads way later and haven't been able to implement them thoroughly in all our resolves as our schema is quite big, and lookaheads help us a lot from the resolver, but for deeply nested queries it gets difficult when selections are deep.