Limiting SQL query to defined fields/columns

somada141 commented 6 years ago

A full working demo can be found under https://github.com/somada141/demo-graphql-sqlalchemy-falcon.

Consider the following SQLAlchemy ORM class:

class Author(Base, OrmBaseMixin):
    __tablename__ = "authors"

    author_id = sqlalchemy.Column(
        sqlalchemy.types.Integer(),
        primary_key=True,
    )

    name_first = sqlalchemy.Column(
        sqlalchemy.types.Unicode(length=80),
        nullable=False,
    )

    name_last = sqlalchemy.Column(
        sqlalchemy.types.Unicode(length=80),
        nullable=False,
    )

Simply wrapped in an SQLAlchemyObjectType as such:

class TypeAuthor(SQLAlchemyObjectType):
    class Meta:
        model = Author

and exposed through:

    author = graphene.Field(
        TypeAuthor,
        author_id=graphene.Argument(type=graphene.Int, required=False),
        name_first=graphene.Argument(type=graphene.String, required=False),
        name_last=graphene.Argument(type=graphene.String, required=False),
    )

    @staticmethod
    def resolve_author(
        args,
        info,
        author_id: Union[int, None] = None,
        name_first: Union[str, None] = None,
        name_last: Union[str, None] = None,
    ):
        query = TypeAuthor.get_query(info=info)

        if author_id:
            query = query.filter(Author.author_id == author_id)

        if name_first:
            query = query.filter(Author.name_first == name_first)

        if name_last:
            query = query.filter(Author.name_last == name_last)

        author = query.first()

        return author

A GraphQL query such as:

query GetAuthor{
  author(authorId: 1) {
    nameFirst
  }
}

will cause the following raw SQL to be emitted (taken from the echo logs of the SQLA engine):

SELECT authors.author_id AS authors_author_id, authors.name_first AS authors_name_first, authors.name_last AS authors_name_last
FROM authors
WHERE authors.author_id = ?
 LIMIT ? OFFSET ?
2018-05-24 16:23:03,669 INFO sqlalchemy.engine.base.Engine (1, 1, 0)

As one can see we may only want the nameFirst field, i.e., the name_first column but the entire row is fetched. Of course the GraphQL response only contains the requested fields, i.e.,

{
  "data": {
    "author": {
      "nameFirst": "Robert"
    }
  }
}

but we have still fetched the entire row, which becomes a major issue when dealing with wide tables.

Is there a way to automagically communicate which columns are needed to SQLAlchemy so as preclude this form of over-fetching?

DrPyser commented 6 years ago

I think right now, you have to inspect the query AST(info.query) to see what fields are queried. Then you can use the load_only(*columns) option in the sqlalchemy query.

fields = get_query_fields(info.query, info.fragments)
query = TypeAuthor.get_query(info=info).options(load_only(*fields))

or something similar. See https://github.com/graphql-python/graphene/issues/348 for ideas about how to implement get_query_fields and alternatives.

Of course this is a simplification, since this only works for the first level of the query. You'd actually have to also look at relationships, recursively, and apply the load_only to those too. It might look like this:

fields = get_query_fields(info.query, info.fragments)
query = TypeAuthor.get_query(info=info)\
    .options(*(
         joinedload(relation).load_only(*relation_fields) 
         if relation is not None else load_only(*fields) # for top level entity
         for relation, relation_fields in fields
))

but this still isn't right, it only takes care of the first level of relationships.

The joinedload option specifies an eager loading method, which we also want to avoid the n+1 problem(see also https://yacine.org/2017/02/27/graphqlgraphene-sqlalchemy-and-the-n1-problem/).

http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html http://docs.sqlalchemy.org/en/latest/orm/loading_columns.html

somada141 commented 6 years ago

Thanks a lot @DrPyser that was a very helpful response! I will give it a go now and update the issue.

somada141 commented 6 years ago

Hey @DrPyser. I tried the implementation of get_query_fields under https://github.com/graphql-python/graphene/issues/348#issuecomment-267717809 but the field-names are 'mapped' to GraphQLised version of themselves.

For example, the return of the aforementioned implementation against this query:

query GetAuthor{
  author(authorId: 1) {
    nameFirst
  }
}

will return ['author', 'author.nameFirst']. While its fairly easy to remove the author prefix (as the table fields won't resolve with that in there), it gets a little trickier reverting camel-case field-names to their original underscored versions, e.g., nameFirst to name_first.

Is there a more concrete way of performing this mapping? Graphene/GrapheneSQLAlchemy does it somehow when mapping ORM fields to GraphQL fields so I was hoping to leverage that somehow.

DrPyser commented 6 years ago

Yup, look in graphene.utils.str_converters. There's a function to_snake_case that converts camel case to snake case(and to_camel_case to do the opposite).

somada141 commented 6 years ago

That's beautiful, thanks a lot @DrPyser.

somada141 commented 6 years ago

Should anyone land here wondering about implementation I got around to implementing my own version of the get_query_fields function as such:

def extract_requested_fields(
    info: graphql.execution.base.ResolveInfo,
    fields: List[Union[Field, FragmentSpread]],
    do_convert_to_snake_case: bool = True,
) -> Dict:
    """Extracts the fields requested in a GraphQL query by processing the AST
    and returns a nested dictionary representing the requested fields.

    Note:
        This function should support arbitrarily nested field structures
        including fragments.

    Example:
        Consider the following query passed to a resolver and running this
        function with the `ResolveInfo` object passed to the resolver.

        >>> query = "query getAuthor{author(authorId: 1){nameFirst, nameLast}}"
        >>> extract_requested_fields(info, info.field_asts, True)
        {'author': {'name_first': None, 'name_last': None}}

    Args:
        info (graphql.execution.base.ResolveInfo): The GraphQL query info passed
            to the resolver function.
        fields (List[Union[Field, FragmentSpread]]): The list of `Field` or
            `FragmentSpread` objects parsed out of the GraphQL query and stored
            in the AST.
        do_convert_to_snake_case (bool): Whether to convert the fields as they
            appear in the GraphQL query (typically in camel-case) back to
            snake-case (which is how they typically appear in ORM classes).

    Returns:
        Dict: The nested dictionary containing all the requested fields.
    """

    result = {}
    for field in fields:

        # Set the `key` as the field name.
        key = field.name.value

        # Convert the key from camel-case to snake-case (if required).
        if do_convert_to_snake_case:
            key = to_snake_case(name=key)

        # Initialize `val` to `None`. Fields without nested-fields under them
        # will have a dictionary value of `None`.
        val = None

        # If the field is of type `Field` then extract the nested fields under
        # the `selection_set` (if defined). These nested fields will be
        # extracted recursively and placed in a dictionary under the field
        # name in the `result` dictionary.
        if isinstance(field, Field):
            if (
                hasattr(field, "selection_set") and
                field.selection_set is not None
            ):
                # Extract field names out of the field selections.
                val = extract_requested_fields(
                    info=info,
                    fields=field.selection_set.selections,
                )
            result[key] = val
        # If the field is of type `FragmentSpread` then retrieve the fragment
        # from `info.fragments` and recursively extract the nested fields but
        # as we don't want the name of the fragment appearing in the result
        # dictionary (since it does not match anything in the ORM classes) the
        # result will simply be result of the extraction.
        elif isinstance(field, FragmentSpread):
            # Retrieve referened fragment.
            fragment = info.fragments[field.name.value]
            # Extract field names out of the fragment selections.
            val = extract_requested_fields(
                info=info,
                fields=fragment.selection_set.selections,
            )
            result = val

    return result

which parses the AST into a dict preserving the structure of the query and (hopefully) matching the structure of the ORM.

Running the info object of a query like:

query getAuthor{
  author(authorId: 1) {
    nameFirst,
    nameLast
  }
}

produces

{'author': {'name_first': None, 'name_last': None}}

while a more complex query like this:

query getAuthor{
  author(nameFirst: "Brandon") {
    ...authorFields
    books {
      ...bookFields
    }
  }
}

fragment authorFields on TypeAuthor {
  nameFirst,
  nameLast
}

fragment bookFields on TypeBook {
  title,
  year
}

produces:

{'author': {'books': {'title': None, 'year': None},
  'name_first': None,
  'name_last': None}}

Now these dictionaries can be used to define what is a field on the primary-table (Author in this case) as they'll have a value of None such as name_first or a field on a relationship of that primary-table such as field title on the books relationship.

A simplistic approach to auto-applying those fields can take the form of the following function:

def apply_requested_fields(
    info: graphql.execution.base.ResolveInfo,
    query: sqlalchemy.orm.Query,
    orm_class: Type[OrmBaseMixin]
) -> sqlalchemy.orm.Query:
    """Updates the SQLAlchemy Query object by limiting the loaded fields of the
    table and its relationship to the ones explicitly requested in the GraphQL
    query.

    Note:
        This function is fairly simplistic in that it assumes that (1) the
        SQLAlchemy query only selects a single ORM class/table and that (2)
        relationship fields are only one level deep, i.e., that requestd fields
        are either table fields or fields of the table relationship, e.g., it
        does not support fields of relationship relationships.

    Args:
        info (graphql.execution.base.ResolveInfo): The GraphQL query info passed
            to the resolver function.
        query (sqlalchemy.orm.Query): The SQLAlchemy Query object to be updated.
        orm_class (Type[OrmBaseMixin]): The ORM class of the selected table.

    Returns:
        sqlalchemy.orm.Query: The updated SQLAlchemy Query object.
    """

    # Extract the fields requested in the GraphQL query.
    fields = extract_requested_fields(
        info=info,
        fields=info.field_asts,
        do_convert_to_snake_case=True,
    )

    # We assume that the top level of the `fields` dictionary only contains a
    # single key referring to the GraphQL resource being resolved.
    tl_key = list(fields.keys())[0]
    # We assume that any keys that have a value of `None` (as opposed to
    # dictionaries) are fields of the primary table.
    table_fields = [
        key for key, val in fields[tl_key].items()
        if val is None
    ]

    # We assume that any keys that have a value being a dictionary are
    # relationship attributes on the primary table with the keys in the
    # dictionary being fields on that relationship. Thus we create a list of
    # `[relatioship_name, relationship_fields]` lists to be used in the
    # `joinedload` definitions.
    relationship_fieldsets = [
        [key, val.keys()]
        for key, val in fields[tl_key].items()
        if isinstance(val, dict)
    ]

    # Assemble a list of `joinedload` definitions on the defined relationship
    # attribute name and the requested fields on that relationship.
    options_joinedloads = []
    for relationship_fieldset in relationship_fieldsets:
        relationship = relationship_fieldset[0]
        rel_fields = relationship_fieldset[1]
        options_joinedloads.append(
            sqlalchemy.orm.joinedload(
                getattr(orm_class, relationship)
            ).load_only(*rel_fields)
        )

    # Update the SQLAlchemy query by limiting the loaded fields on the primary
    # table as well as by including the `joinedload` definitions.
    query = query.options(
        sqlalchemy.orm.load_only(*table_fields),
        *options_joinedloads
    )

    return query

aonamrata commented 5 years ago

Is this still the only way to do it? even with the new version out?

somada141 commented 5 years ago

@aonamrata I haven't seen any new approach being available. Which new version are you referring to?

aonamrata commented 5 years ago

never mind, it does not have it. I saw couple of PRs to add it and was assuming it should be in the next version .. after 2.0.

thejcannon commented 5 years ago

I know this is closed, but should the code snippet @somada141 provided be merged into this library? Assuming it gets thoroughly tested, I can't think of a reason why we wouldn't want to fix the overfetching of data from the DB and fix the N+1 problem.

I believe this is the missing piece that would make using SQLAlchemy + graphene + graphene-sqlalchemy a truly beautiful thing.

thejcannon commented 5 years ago

Note this would probably solve #100 and #109, which mention overfetching and N+1, respectively.

DrPyser commented 5 years ago

Yeah, this is a fairly complex problem and solution, you don't want everyone to implement their own version. This should also be a basic concern of a library like this one. It might help if the graphene or graphql-core libraries provided more utilities to manipulate the AST.

jnak commented 5 years ago

Yeah I agree that something like that should be merged in to this library at some point. Ideally we should write some sort of design document before we delve into the actual implementation because there are many considerations and edge cases to take into account. For example, I think the N+1 issue is somewhat orthogonal to this problem and could be solved with different approaches.

I'm currently working on improving the out-of-the-box performance of graphene-sqlalchemy. I'll probably get to this at some point.

jnak commented 4 years ago

Hi, just a heads up that I sent a PR to address the N+1 problem. I ended up following a different approach than @somada141 because one of my goals was that this optimization should work well with schemas that include non-graphene-sqlalchemy types. I would love to hear your thoughts. Cheers.

aldiyar-zharkimbayev commented 3 years ago

Hi, I didn't find full answer, so I combined answers and made solution for "n+1", but "cartesian product" appears.

edited version of extract_requested_fields function from @somada141, that converts requested fields to dict
my makeLoadOnlyOptions function, that converts dict to query options
custom ConnectionField class LoadOnlyConnectionField - if you don't want custom class, you can use this example to resolve query manually

def extract_requested_fields(
        info: graphql.execution.base.ResolveInfo,
        fields: List[Field],
        do_convert_to_snake_case: bool = True,
) -> Dict:
    result = {}
    for field in fields:

        key = field.name.value
        if do_convert_to_snake_case:
            key = to_snake_case(name=key)

        if key == "id":
            continue
        val = None
        if isinstance(field, Field):
            if (
                    hasattr(field, "selection_set") and
                    field.selection_set is not None
            ):
                val = extract_requested_fields(
                    info=info,
                    fields=field.selection_set.selections,
                    do_convert_to_snake_case=do_convert_to_snake_case
                )
            if val and 'edges' in val:
                val = val['edges']
            if val and 'node' in val:
                val = val['node']
            result[key] = val
        elif isinstance(field, FragmentSpread):
            fragment = info.fragments[field.name.value]
            val = extract_requested_fields(
                info=info,
                fields=fragment.selection_set.selections,
                do_convert_to_snake_case=do_convert_to_snake_case
            )
            result = val
    return result

def makeLoadOnlyOptions(fieldsDict, model):
    if not fieldsDict:
        return []
    options = []
    modelFields = []
    relationshipFields = []
    for field in list(fieldsDict):
        if fieldsDict[field] is None:
            modelFields.append(field)
        else:
            relationshipFields.append(field)
    if modelFields:
        options.append(load_only(*modelFields))
    for relationshipField in relationshipFields:
        relationshipOptions = makeLoadOnlyOptions(fieldsDict[relationshipField],
                                                 getattr(model, relationshipField).property.mapper.class_)
        options.append(joinedload(getattr(model, relationshipField)).options(*relationshipOptions))
    return options

class LoadOnlyConnectionField(graphene_sqlalchemy.SQLAlchemyConnectionField):
    def __init__(self, connection, *args, **kwargs):
        super().__init__(connection, *args, **kwargs)

    @classmethod
    def get_query(cls, model, info: 'ResolveInfo', sort=None, **args):
        query = super().get_query(model, info, sort, **args)
        queryName = info.field_name

        fieldsDict = extract_requested_fields(info, info.field_asts, False)
        fieldsDict = fieldsDict[queryName]
        options = makeLoadOnlyOptions(fieldsDict, model)
        query = query.options(*options)
        # print(query)
        return query

class Query(graphene.ObjectType):
    node = relay.Node.Field()

    searchTitlesRelay = LoadOnlyConnectionField(connection=Title, sort=Title.sort_argument())

{
    searchTitlesRelay{
        edges {
            cursor
            node {
                uuid
                code
                titleAuthors {
                    edges {
                        node {
                            uuid
                            user {
                                uuid
                                name
                            }
                        }
                    }
                }
                chapters {
                    edges {
                        node {
                            uuid
                        }
                    }
                }
            }
        }
    }
}

SELECT  anon_1.title_id AS anon_1_title_id, anon_1.title_code AS anon_1_title_code,
    title_author_1.id AS title_author_1_id,
    users_1.id AS users_1_id, users_1.name AS users_1_name,
    chapter_1.id AS chapter_1_id
FROM (
    SELECT title.id AS title_id, title.code AS title_code
    FROM title ORDER BY title.id ASC
    LIMIT 2
) AS anon_1
LEFT OUTER JOIN title_author AS title_author_1 ON anon_1.title_id = title_author_1.title_id
LEFT OUTER JOIN users AS users_1 ON users_1.id = title_author_1.user_id
LEFT OUTER JOIN chapter AS chapter_1 ON anon_1.title_id = chapter_1.title_id
ORDER BY anon_1.title_id ASC

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related topics referencing this issue.

graphql-python / graphene-sqlalchemy

Limiting SQL query to defined fields/columns #134