Amnesthesia commented 2 years ago

Is your feature request related to a problem? Please describe.

Dataloaders are amazing for avoiding N+1 queries, but I find in many cases I've ended up with some N+1 queries regardless, and I haven't figured out a good way to solve it.

Admittedly, this is a failing on my end, but I find it difficult with the very limited amount of documentation and examples on dataloader sources, and I'm honestly surprised there aren't more information around online on how to implement some basic ones for has_many and has_many .., through: relationships

In many cases I've resorted to solutions like

def participants
   dataloader.with(Sources::ActiveRecord, User).load_all(object.user_participants.pluck(:user_id)
end

But this then requires us to first fetch that join-table and get all the User IDs, and then load the users, to be able to return promises for each of those IDs.

The alternative would have been to cache the entire collection by the parents ID, but that's not really that helpful

Describe the solution you'd like

It would be amazing with some default dataloader sources like dataloader.with(GraphQL::Dataloader::Association, User).load(parent, :association_name) or similar, caching the records both with their ID (when they have been resolved) and under the parent's ID and association if that association gets resolved again

In general, it would be great with some more advanced examples or even built-in sources for the most common use-cases such as has-many associations

Describe alternatives you've considered

We often use lookaheads here to make sure we've joined associated tables and have them available in the query, and it happens that those lookaheads get wasted when a type uses a dataloader to return a field.

Admittedly I discovered lookaheads way later and haven't been able to implement them thoroughly in all our resolves as our schema is quite big, and lookaheads help us a lot from the resolver, but for deeply nested queries it gets difficult when selections are deep.

itkrt2y commented 2 years ago

I hope this would be helpful for you https://gist.github.com/itkrt2y/1e1a947c71772044f5d67f358b4772fc

rmosolgo commented 2 years ago

@itkrt2y, could you share some "before" and "after" SQL queries when that association loader is used? (AFAIK, there's no good way to batch has_many associations, especially when pagination is used -- but I'd love to see an example to the contrary!)

In my experience, there are a few ways of handling has_manys:

Pluck ids, then load all the records by ID. I think this is how Facebook's TAO architecture works, and if you were using something like https://github.com/shopify/identity_cache, maybe it would be a good way to leverage your cache. But especially when paging though a filtered relation, it's a lot of work to identify which IDs to fetch. Then, you end up making two trips to the database: one to fetch IDs, then another to fetch records. Wouldn't it have been better to simply load the records at the beginning? (Maybe not, if you know that the batched load-by-id call will have other record in it, besides the ones from the relation.)
Use Rails preloader to reload the relation. I honestly don't know what to make of this -- If you're loading the first 5 of one relation, and the second 5 items of another relation, can Rails really make this more efficient than two round trips to the database? And besides, let's say you use the preloader on the relation. Almost certainly, there will be some kind of WHERE, OFFSET, LIMIT, etc applied to the relation after it's loaded. Won't ActiveRecord have to make another database roundtrip when those are applied? (Maybe this doesn't apply if you're returning unbounded result sets to clients, which is a no-no in the first place, but 🤷 .)
Finally, just don't use any batch-loading for has_many relations. For my part, I don't know any way to actually batch-load queries for has_many relations. So, making one query for each relation is pretty much as good as it gets. (

I have seen one way to actually batch SQL calls for has_many relations, using Postgres window functions, but I've never heard of anyone using it IRL: https://pganalyze.com/blog/efficient-graphql-queries-in-ruby-on-rails-and-postgres

Anyways, all that to say:

What database queries does your system currently execute to load those records?
What database queries do you wish that it ran to load those records?

After those questions are answered, we can investigate options from getting from one to the other 🍻

itkrt2y commented 2 years ago

@rmosolgo I created a sample app for my association dataloader => https://github.com/itkrt2y/rails-association-loader-sample

Queries results

before/after of association dataloader

has_many

before

{users {blogs {title}}}

  User Load (0.1ms)  SELECT "users".* FROM "users"
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC  [["user_id", 1]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC  [["user_id", 2]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC  [["user_id", 3]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC  [["user_id", 4]]
  Blog Load (0.2ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC  [["user_id", 5]]

after

{users {blogDataloader {title}}}

  User Load (0.1ms)  SELECT "users".* FROM "users"
  Blog Load (0.3ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC  [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]

has_many and connection_type

before

{users {blogsConnection {nodes {title}}}}

  User Load (0.2ms)  SELECT "users".* FROM "users"
  Blog Load (0.2ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ?  [["user_id", 1], ["OFFSET", 0]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ?  [["user_id", 2], ["OFFSET", 0]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ?  [["user_id", 3], ["OFFSET", 0]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ?  [["user_id", 4], ["OFFSET", 0]]
  Blog Load (0.1ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" = ? ORDER BY "blogs"."id" ASC LIMIT -1 OFFSET ?  [["user_id", 5], ["OFFSET", 0]]

after

{users {blogsConnectionDataloader {nodes {title}}}}

  User Load (0.1ms)  SELECT "users".* FROM "users"
  Blog Load (0.3ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC  [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]

has_many through

before

{patients {physicians {name}}}

  Patient Load (0.1ms)  SELECT "patients".* FROM "patients"
  Physician Load (0.1ms)  SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ?  [["patient_id", 1]]
  Physician Load (0.2ms)  SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ?  [["patient_id", 2]]
  Physician Load (0.3ms)  SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ?  [["patient_id", 3]]
  Physician Load (0.2ms)  SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ?  [["patient_id", 4]]
  Physician Load (0.2ms)  SELECT "physicians".* FROM "physicians" INNER JOIN "appointments" ON "physicians"."id" = "appointments"."physician_id" WHERE "appointments"."patient_id" = ?  [["patient_id", 5]]

after

{patients {physiciansDataloader {name}}}

  Patient Load (0.1ms)  SELECT "patients".* FROM "patients"
  Appointment Load (0.2ms)  SELECT "appointments".* FROM "appointments" WHERE "appointments"."patient_id" IN (?, ?, ?, ?, ?)  [["patient_id", 1], ["patient_id", 2], ["patient_id", 3], ["patient_id", 4], ["patient_id", 5]]

rmosolgo commented 2 years ago

Thanks for sharing those examples. What if you include pagination parameters, for example, first: 3?

itkrt2y commented 2 years ago

It also works as expected

{users {blogsConnectionDataloader(first: 3) {nodes {title}}}}

Started POST "/graphql" for 127.0.0.1 at 2022-06-02 10:14:35 +0900
Processing by GraphqlController#execute as */*
  Parameters: {"operationName"=>nil, "variables"=>{}, "query"=>"{\n  users {\n    blogsConnectionDataloader(first: 3) {\n      nodes {\n        title\n      }\n    }\n  }\n}\n", "graphql"=>{"operationName"=>nil, "variables"=>{}, "query"=>"{\n  users {\n    blogsConnectionDataloader(first: 3) {\n      nodes {\n        title\n      }\n    }\n  }\n}\n"}}
  User Load (0.1ms)  SELECT "users".* FROM "users"
  Blog Load (0.6ms)  SELECT "blogs".* FROM "blogs" WHERE "blogs"."user_id" IN (?, ?, ?, ?, ?) ORDER BY "blogs"."id" ASC  [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]

rmosolgo commented 2 years ago

as expected

I guess that's where I disagree. If the GraphQL query is asking for first: 3, I would expect a bounded SELECT of some kind (LIMIT or WHERE could both work). But by sending an unbounded select, we risk causing a lot of DB and application latency while results are fetched and prepared, then related objects are initialized by Rails.

Maybe unbounded selects are OK for your use cases, but I've seen them go wrong often enough that I can't recommend it as an approach 🙈 !

rmosolgo commented 2 years ago

It sounds like we got to the end of our discussion here. Thanks again for sharing what you found! Hopefully it will be helpful to anyone else who is looking for the same behavior.

Amnesthesia commented 2 years ago

@rmosolgo Sometimes when it comes to has_many relationships, where we know those items are likely to be requested again later on, we've ended up doing things like dataloader.with(Sources::ActiveRecord, Child).load_all(object.children.pluck(:id))

This doesn't look very nice and to me this looks like doing the same lookup twice in the hopes that dataloader will then have it cached for later. Could we push these straight into dataloaders cache if we already have the records?

rmosolgo commented 2 years ago

We could definitely add something like that... maybe:

results = object.children.each_with_object({}) { |child, hash| hash[child.id] = child }
# Add these `{ key => object }` pairs to the Dataloader source:
dataloader.with(Sources::ActiveRecord, Child).merge(results)

What do you think of that?

Amnesthesia commented 2 years ago

@rmosolgo Something like that would be awesome, would be even nicer if it just allowed you to .push(record) since it would know how to create a cache key for each record anyway?

rmosolgo commented 2 years ago

know how to create a cache key

That's the part I don't quite know about ... I know it can go from key to record, but how could it go from record to key?

Amnesthesia commented 2 years ago

@rmosolgo Ah yeah I see what you're saying, I think I had it a bit mixed up in my mind with the GraphQL::Batch gem that have a cache_key(...) method. Either way, any way of pushing into the existing cache would be awesome!

rmosolgo commented 2 years ago

I suggested an implementation in https://github.com/rmosolgo/graphql-ruby/pull/4186

Amnesthesia commented 2 years ago

Yay thank you so much! That was fast!

rmosolgo / graphql-ruby

Dataloader Sources #4075

Queries results

has_many

before

after

has_many and connection_type

before

after

has_many through

before

after