Uncaught exceptions lead to `on_execute` / `on_operation` lifecycle hooks completing before some resolvers

kkom commented 7 months ago

TL;DR

Strawberry short-circuits the HTTP response whenever there is an uncaught exception. This reduces the latency, but leads to:

~(i) (a) incomplete and (b) nondeterministic responses~ (edit: established in the comments that it's expected) (ii) hooks being completed before some resolvers, leading to apparent violation of a contract

I wonder if it would be possible to make Strawberry run all resolves to the end, even if some of them raise uncaught exceptions?

Describe the Bug

Strawberry executes all resolvers, even if there was an uncaught exception which triggered an early HTTP response with errors.
However, it eagerly returns a response with errors, as soon as an (edit: incoercible) exception is raised.
Finally, it completes all lifecycle hooks before return the response – including on_execute and on_operation.

This last point can lead to issues – it violates the invariant that on_execute / on_operation lifecycle hooks wrap around all resolver executions.

This can be problematic when these hooks do state management, like in the example given in Strawberry's docs. As a result, in addition to seeing the original uncaught exception in our observability suite, we have additional noise from knock-on failures – caused by premature completion of hooks.

Is this natural behaviour given how various async tasks are orchestrated, or is possible to tweak this a little? I'm thinking:

cancelling the tasks that won't be used for the response anyway (as it's been already returned)
waiting until all resolvers finish to return the response

~In fact, 2 may have other benefits – making the responses more (a) complete and (b) predictable. Currently, the GraphQL responses (i.e. which fields will return data and which won't) are non-deterministic (albeit a little faster thanks to the uncaught exception short-circuit).~ (edit: established in the comments that the short-circuiting is expected)

Repro code

Schema:

@strawberry.type
class Query:
    @strawberry.field
    @staticmethod
    async def fail() -> str:
        await sleep(0.5)
        raise Exception(f"'fail' resolver has failed ({datetime.now()})")

    @strawberry.field
    @staticmethod
    async def wait() -> str:
        await sleep(2)
        print(f"'wait' resolver is about to return ({datetime.now()})")
        return "foo"

Logging extension:

class MyCustomExtension(SchemaExtension):
    @override
    def on_execute(self) -> Generator[None, None, None]:
        print(f"'on_execute' start ({datetime.now()})")
        yield
        print(f"'on_execute' end ({datetime.now()})")

    @override
    async def resolve(
        self,
        _next: Callable[..., object],
        root: object,
        info: GraphQLResolveInfo,
        *args,
        **kwargs,
    ) -> AwaitableOrValue[object]:
        print(f"'{info.field_name}' resolver start ({datetime.now()})")
        result = await await_maybe(_next(root, info, *args, **kwargs))
        print(f"'{info.field_name}' resolver end ({datetime.now()})")
        return result

Example query:

query {
  fail
  wait
}

Example response:

{
  "data": null,
  "errors": [
    {
      "message": "'fail' resolver has failed (2024-03-19 21:08:12.088337)",
      "locations": [
        {
          "line": 2,
          "column": 3
        }
      ],
      "path": [
        "fail"
      ]
    }
  ]
}

Logs demonstrating that the resolvers continue being executed after hooks complete:

'on_execute' start (2024-03-19 21:08:11.587192)
'fail' resolver start (2024-03-19 21:08:11.587345)
'wait' resolver start (2024-03-19 21:08:11.587378)
'fail' resolver has failed (2024-03-19 21:08:12.088337)

GraphQL request:2:3
1 | query {
2 |   fail
  |   ^
3 |   wait
Traceback (most recent call last):
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/graphql/execution/execute.py", line 528, in await_result
    return_type, field_nodes, info, path, await result
                                          ^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/backend/api/graphql/extensions/extensions.py", line 30, in resolve
    result = await await_maybe(_next(root, info, *args, **kwargs))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/utils/await_maybe.py", line 12, in await_maybe
    return await value
           ^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/schema/schema_converter.py", line 682, in _async_resolver
    return await await_maybe(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/utils/await_maybe.py", line 12, in await_maybe
    return await value
           ^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/backend/api/graphql/schemas/public.py", line 55, in fail
    raise Exception(f"'fail' resolver has failed ({datetime.now()})")
Exception: 'fail' resolver has failed (2024-03-19 21:08:12.088337)
'on_execute' end (2024-03-19 21:08:12.096968)
INFO:     127.0.0.1:58138 - "POST /graphql HTTP/1.1" 200 OK
'wait' resolver is about to return (2024-03-19 21:08:13.588281)
'wait' resolver end (2024-03-19 21:08:13.588422)

System Information

Strawberry version (if applicable): 0.220.0

Additional Context

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

patrick91 commented 7 months ago

I think that's expected when field aren't optional, if you make the failing field option it should work as expected. Here's a test without the exception: https://play.strawberry.rocks/?gist=05181d8366f2584d390ccdac21d5d000

kkom commented 7 months ago

Thanks!

I think that's expected when field aren't optional

When you say this, do you mean according to the GraphQL spec or Strawberry's design decision?

patrick91 commented 7 months ago

@kkom from the spec: https://spec.graphql.org/October2021/#sec-Handling-Field-Errors

Here's an example that better shows the behaviour of failures in nested fields: https://play.strawberry.rocks/?gist=b1ca6f1c33325331a76e36fb0d80ebfa

kkom commented 7 months ago

Ah, I see it now – thanks! Yeah, the short-circuiting behaviour makes perfect sense :)

I tested it in anger in a lot of configurations and every behaviour agreed with the GraphQL spec.

PS: I only found some small non-determinism in the errors field. The speed with which coercible and incoercible errors occur affects the contents of errors. Say you have two fields that fail – one nullable and one not. If the nullable one fails first – you'll see two error entries. But if the non-nullable one fails first – we won't include the error for the nullable one. I don't think that's a huge deal though and doubt the spec worries about it.

kkom commented 7 months ago

However, we still have the problem of on_execute and on_operation hooks completing before resolvers can finish. 😓

We use a custom Strawberry extension for context management. Basically same as this – it opens and closes a SQLAlchemy session which can be used in the resolvers.

If a response is prematurely concluded because of an incoercible field error, the unfinished resolvers will run in the background. However, it's very likely they'll raise more errors – which we do see in our Datadog logs.

The errors are varied, but all of them are noise – they don't matter anymore and happen because the context for the field resolvers is just messed up by the lifecycle hooks...

Solutions I'm thinking of

Strawberry could somehow instruct the Python event loop to abandon the no longer necessary tasks.
- Not sure if this is possible.
- But if it was possible, it would have resource usage benefits (thought not latency). Marginal though.
Strawberry could be configured to wait for all resolvers to finish, even the unneeded ones. Only then finish of on_execute and on_operation lifecycle hooks and return the response.
- If this was a configuration option, we'd definitely turn it on. We'd increase the latency of failed responses, but they're failed anyway ¯\_(ツ)_/¯ and the clear error logs benefits seem more important to us now.
Maybe there is a way to instruct the Datadog extension to ignore errors that happen in orphaned resolver execution. Not sure if this is easy or not.
- Even if possible at one point, I feel like it could end up being an endless game of Whac-A-Mole...

What do you think? Any ideas appreciated! 🙏

kkom commented 7 months ago

I guess there is theoretically the 4th option of the lifecycle hooks waiting for also the unneeded resolvers to finish.

I wonder though if the HTTP response concluding before the on_execute / on_operation lifecycle hooks finish, would cause even more problems...

kkom commented 7 months ago

Oh, I think there is a decent variation on option 4!

What if it was possible to configure a schema extension to wait for all resolvers to finish before concluding? Then we could selectively apply it to the extension where it matters (e.g. the one that manages our DB session).

Not sure how difficult it would be to implement and how much complexity it would add to the code, but I think it would solve all our problems. 🤞

patrick91 commented 7 months ago

What if it was possible to configure a schema extension to wait for all resolvers to finish before concluding? Then we could selectively apply it to the extension where it matters (e.g. the one that manages our DB session).

Is there any particular reason why you're doing this on an extension and not using your web framework functionality?

In any case I think we could introduce a hook for when the request is done 😊 @nrbnlulu what do you think?

nrbnlulu commented 7 months ago

What an endeavour @kkom :upside_down_face:!

So what I understand from this issue

When a NonNull field fails on_execute | on_operation exits BEFORE all the other resolvers are gathered This can be problematic for life-cycle resources because they might have been cleaned by the exit hook.

I think this issue does not relate to extensions but for graphql-core here They don't cancel tasks if the execution is terminated, So I suggest to open an issue there.

BTW:

I don't know how it will behave with the new execution mechanism in graphql-core (since @defere / @stream support was added).

kkom commented 7 months ago

Is there any particular reason why you're doing this on an extension and not using your web framework functionality?

That's a very good question! The context we prepare is highly dependent on whether the GraphQL operation is a query or a mutation.

In queries we instantiate a special DB client optimised for reads – without the overhead of explicit transactions and leveraging separate sessions for each SQL query (to support concurrent reads). In writes we instantiate a DB client that works off a single transaction for the entire mutation. ```python from collections.abc import AsyncGenerator from copy import copy from typing import cast, override from strawberry.extensions import SchemaExtension from strawberry.types.graphql import OperationType from backend.api.context import get_read_context, get_write_context from backend.domain.context import ReadContext, RequestContext, WriteContext class ContextManagerExtension(SchemaExtension): """ Depending on whether it's a mutation or query, replace appropriate keys in the context. At the end, go back to the original context injected by FastAPI. Mutating context in this way is suggested in Strawberry docs, see: https://strawberry.rocks/docs/guides/custom-extensions#execution-context """ # We are managing the context in the more narrow `on_execute` hook (which wraps just the # execution of the query/mutation, after parsing & validation have completed), instead of the # overarching `on_operation` hook, because we need the knowledge whether the operation is a # query or a mutation (so that we can prepare a "read" or "write" context). @override async def on_execute(self) -> AsyncGenerator[None, None]: # Initially, we have just RequestContext injected by FastAPI context = cast(RequestContext, self.execution_context.context) def mutate_context(new_context: RequestContext | ReadContext | WriteContext) -> None: # Clear existing keys keys = list(context.keys()) for k in keys: del context[k] # Replace keys with new values for k, v in new_context.items(): context[k] = v request_context = copy(context) # `self.execution_context.operation_type` relies on the parsing step having already run. # This is why this extension uses the `on_execute` lifecycle hook. match self.execution_context.operation_type: case OperationType.MUTATION: async with get_write_context(context) as write_context: mutate_context(write_context) try: yield finally: mutate_context(request_context) case OperationType.QUERY: read_context = get_read_context(context) mutate_context(read_context) try: yield finally: mutate_context(request_context) case OperationType.SUBSCRIPTION: raise NotImplementedError() ``` ```python @asynccontextmanager async def get_write_context( request_context: Annotated[RequestContext, Depends(get_request_context)], ) -> AsyncGenerator[WriteContext, None]: async with request_context["session_maker"]() as session: db_client = ReadWriteDbClient(session) actions = build_actions( db_client, request_context["authentication_context"], request_context["clients"]["event_logger"], ) adhoc_data_loaders = build_adhoc_data_loaders(db_client) queries = build_queries( db_client, request_context["authentication_context"], request_context["clients"]["blob_store"], actions, ) services = build_services( db_client, request_context["authentication_context"], request_context["clients"], ) yield { "actions": actions, "authentication_context": request_context["authentication_context"], "clients": request_context["clients"], "adhoc_data_loaders": adhoc_data_loaders, "queries": queries, "services": services, "db_client": db_client, } await session.commit() def get_read_context( request_context: Annotated[RequestContext, Depends(get_request_context)], ) -> ReadContext: db_client = ReadOnlyDbClient(request_context["read_only_session_maker"]) actions = build_actions( db_client, request_context["authentication_context"], request_context["clients"]["event_logger"], ) adhoc_data_loaders = build_adhoc_data_loaders(db_client) queries = build_queries( db_client, request_context["authentication_context"], request_context["clients"]["blob_store"], actions, ) return { "actions": actions, "authentication_context": request_context["authentication_context"], "clients": request_context["clients"], "adhoc_data_loaders": adhoc_data_loaders, "queries": queries, "db_client": db_client, } ```

That's why we use the on_execute hook – to leverage self.execution_context.operation_type already prepared by Strawberry. It is quite elaborate, but the performance benefits of using different DB access patterns are huge.

But thanks for the suggestion – maybe there is a decent way to leverage the web framework for it! We could use a heuristics on the body of the request to distinguish queries from mutations before Strawberry kicks in. Or always construct both kinds of contexts (though this may be inefficient).

I think this issue does not relate to extensions but for graphql-core here They don't cancel tasks if the execution is terminated, So I suggest to open an issue there.

Thanks so much for the pointer! I'll try to report it there – feels like fixing this should be valuable whichever way we go! :)

kkom commented 7 months ago

Ok, reported this to graphql-core: https://github.com/graphql-python/graphql-core/issues/217 - thanks for the suggestion @nrbnlulu !

@patrick91 – I'll play with using FastAPI for it, but I've realised that there may be a problem. If these tasks are somehow abandoned on the event loop, I'm not sure if even FastAPI would be able to wrap around them. Remember that the HTTP response is returned by that time...

kkom commented 7 months ago

A little update from graphql-core - it was acknowledged as something the maintainer did want to address as well. :) https://github.com/graphql-python/graphql-core/issues/217#issuecomment-2015919484

So the issue is definitely real - but not yet sure when it'll be addressed.

strawberry-graphql / strawberry