inherited connections are not query-optimized

TWeidi commented 11 months ago

Describe the Bug

When inheriting from a strawberry_django.type the relay-connections defined on the base-type are not respected by the query-optimizer anymore when querying the relay-connection on the sub-typ

Example:

@type(models.Place)
class PlaceType(relay.Node):
    '''This is the the base-type'''
    some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection(prefetch_related=['some_m2m_relation'])

@type(models.Restaurant)
class RestaurantType(PlaceType):
    '''This is the the sub-type'''
    another_field: auto

Querying the RestaurantType for the some_m2m_relation shows the n+1 problem unless i repeat the field definition of some_m2m_relation on the RestaurantType, preventing me from keeping my code DRY:

This works well:

@type(models.Place)
class PlaceType(relay.Node):
    '''This is the the base-type'''
    some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection(prefetch_related=['some_m2m_relation'])

@type(models.Restaurant)
class RestaurantType(PlaceType):
    '''This is the the sub-type'''
    another_field: auto
    some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection(prefetch_related=['some_m2m_relation'])

Is this the intended behavior?

System Information

strawberry-graphql 0.204.0
strawberry-graphql-django 0.16.0

Additional Context

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

bellini666 commented 11 months ago

Hi @TWeidi ,

This seems to be the same issue reported in https://github.com/strawberry-graphql/strawberry-graphql-django/issues/340 and https://github.com/strawberry-graphql/strawberry-graphql-django/issues/337

As I mentioned in both, can you try removing the prefetch_related from the connection and tell me if it starts working properly?

TWeidi commented 11 months ago

Hey,

thank you for your response. I tried all possible combinations of field definitions on both Place and Restaurant types:

Variant1:

some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection(prefetch_related='some_m2m_relation'])

Variant2:

some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection()

Variant3: some_m2m_relation field not defined

PQ: Place query:

places {
  edges {
    node {
      some_m2m_relation {
        edges {
          node {
            id
          }
        }
      }
    }
  }
}

RQ: Restaurant query:

restaurants {
  edges {
    node {
      some_m2m_relation {
        edges {
          node {
            id
          }
        }
      }
    }
  }
}

                                                     Place                         
                                Variant1            Variant2           Variant3
                              PQ       RQ         PQ       RQ        PQ        RQ

               Variant1       Yes      Yes        No       Yes       N/A       Yes
 Restaurant    Variant2       Yes      No         No       No        N/A       No
               Variant3       Yes      No*        No       No        N/A       N/A

* I would expect a 'Yes' here due to inheriance. All other cases behave as I would expect

Currently I need to use Variant1 on both Place and Restaurant types to enable optimization for both PQ and RQ

bellini666 commented 11 months ago

Hi @TWeidi ,

Ohhh, I'm so sorry, I got confused by the bug report because it was very similar to the other 2 I mentioned. You are mentioning n+1 here, not the lack of query optimization for those queries.

Regarding those issues:

Nested queries are currently skipped by the optimizer, as you can see in this line

The reason for that is because connections are used for pagination. If you prefetch all the results, you would either be fetching a lot more than you want to then slice a possibly very large list in memory, or your cache would be thrown away and you would have to make another query for the connection itself.

To avoid that the optimizer would need to prefetch results for all items in the parent connection considering the applied pagination (together with any kind of filtering/ordering), and I haven't find a way to be able to do that correctly yet. Not sure if it is even possible TBH, consider this example:

Parent1: ChildConnection1
Parent2: ChildConnection2
Parent3: ChildConnection3

For lists django can easily prefetch all 3 child connections for the 3 parents, but if it does that trying to apply pagination on those, how can it tell that query to limit it to like 10 results for each child connection? Maybe this would require the usage of RowNumber and filter the results which rownumber is < 10, but I'm not sure if performance-wise this would be better than letting the n+1 for very large results. e.g. say each childconnection has 1m+ results, rownumber as a window function would be applied after the results are materialized, meaning you could be replacing 3 ~1ms extra queries by 1 ~100ms.

The prefetch_related issue is that, if you ask the optimizer to do that, it will do the prefetch for all items I mentioned in the previous issue. Meaning that you would be fetching in memory all the rows for them to be sliced in a list, or even to be thrown away after pagination is applied to the queryset, making it even worse than not optimizing.

For those kinds of nested relations, specially considering those limitations, I would say that you either want a simple list if the results from it are small enough, or let the n+1 issue happen and try to avoid letting too many queries happen if possible. e.g. avoid exposing a root query with a connection/list of a type that has nested connections

TWeidi commented 10 months ago

Hey @bellini666,

I guess i get your point regarding the nested m2m queries not being limited leading to a lot of overhead, but as far as i have tested your library handles that case already as I would expect. The query below is working perfectly fine:

places(first: 50) {
  edges {
    node {
      some_m2m_relation(first: 10) {
        edges {
          node {
            id
          }
        }
      }
    }
  }
}

So, i can indeed limit the nested lists to some sensible length and get the expected result within acceptable time while it costs me only 2 database queries (no matter how many "some_m2m_relation"s are associated with one place) according to the django toolbar, but if i replace the places query with the restaurants query, the toolbar shows that this hits the database a lot more often, depending on the number of "some_m2m_relation"s associated with each place.

So, to rephrase my actual question: Why is this property:

some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection(prefetch_related=['some_m2m_relation'])

not inherited 'as is' when inheriting the type for the Restaurant model from that of the Place model but obviously replaced by:

some_m2m_relation: ListConnectionWithTotalCount[SomeRelatedType] = connection()

bellini666 commented 9 months ago

A quick comment in here. I'll soon try to workaround this with something like this

strawberry-graphql / strawberry-django