graphql-python / graphene-sqlalchemy

Graphene SQLAlchemy integration
http://docs.graphene-python.org/projects/sqlalchemy/en/latest/
MIT License
980 stars 226 forks source link

Session is None in batch_load_fn #370

Closed msmedes closed 1 year ago

msmedes commented 1 year ago

Versions:

graphene==2.1.9
graphene-sqlalchemy==2.3.0
sqlalchemy==1.3.18
flask-sqlalchemy==2.4.4
flask==1.1.2

We've been using batching ever since we created our gql service but recently started randomly getting the error 'NoneType' object has no attribute 'dirty' on the line https://github.com/graphql-python/graphene-sqlalchemy/blob/32d0d184c74386886b8e67763c6b7db836b323ad/graphene_sqlalchemy/batching.py#L67 and I noticed the above code comment at the top of the block https://github.com/graphql-python/graphene-sqlalchemy/blob/32d0d184c74386886b8e67763c6b7db836b323ad/graphene_sqlalchemy/batching.py#L61 What are the conditions that would cause this issue to arise? Is SQLAlchemy just dropping a session mid-request? We are passing in the db.session object from flask-sqlalchemy to the flask graqphl view, and I'm guessing the issue has to do with session scoping?

Thanks!

erikwrede commented 1 year ago

Without looking at your code in detail or having steps to reproduce It's hard to provide guidance here. Your assumptions about the scoped sessions sound promising to investigate. Generally, scoped sessions should never just be None though, and the session shouldn't be dropped mid request. Since we're using Session.object_session to get the session from one of the parents here, is there any chance that you recently started returning objects not yet added to a session or from an old expired (e.g. cached) Session from your resolvers? Mutations could also be the cause (e.g. returning an object without adding it also causes object_session to return None)

However, judging from the fact that this issue has randomly started appearing in your stack without any updates to graphene-sqlalchemy==2.0, I can only assume that this is a side effect due to a recent change on your end. Depending on the frequency of the error, going back in the deployment history might help here. Please keep us posted!

msmedes commented 1 year ago

@erikwrede Thanks for the response The one change I neglected to mention is that we started using promise DataLoaders for relationships in custom resolvers for reasons, could a mix of graphene-sqlalchemy dataloaders and our manually written ones cause this? Say if graphene-sqlalchemy was using aiodataloader (which I think may be the version of dataloader used for our version sqlalchemy, and we were using the promise based dataloaders, would that cause issues with sqlalchemy?

erikwrede commented 1 year ago

Since you're on 2.x I doubt that you're using aiodataloader in graphene-SQLAlchemy. Async support was first added in graphene 3.0. The source lines you quoted above are from the 3.0 beta. The 2.0 data loaders are promise based.

Maybe your custom dataloaders return None at some point or create objects not bound to a session? Is there a specific pattern of queries/mutations where this behavior occurs vs other queries that are executing nominally?

msmedes commented 1 year ago

I was finally able to get some time to look into this and was able to isolate the part of the query that is causing the issue, the fields themselves don't matter too much but the structure seems to:

{
   ...rest of query
   foo {
       relationshipFromACustomResolverThatUsesACustomDataLoader {
         attributeA # resolves fine
         attributeB # resolves fine
         relationshipThatUsesTheDefaultBatchResolver { # blows up
             ...someProperties
      }
   }
}

This query doesn't always fail. It will occasionally work once or twice and then start throwing the NoneType object has no attribute 'dirty' error, and if I look at the traceback I can see the promise queue look something like this:

[
<Promise at 0x7fe7ff8028e0 rejected with AttributeError("'NoneType' object has no attribute 'dirty'")>, 
<Promise at 0x7fe7e4c84ee0 fulfilled with attributeA>, 
<Promise at 0x7fe7ff802730 fulfilled with attributeB>
]

so in the above example query it is resolving attributes A and B just fine, and then blowing up when it's time to grab that relationship object.

Also worth noting its only blowing up in prod with where we use gunicorn, but cannot repro locally where we are just using the standard flask dev server. Am I missing something here? Is calling a relationship from inside an object returned by a dataloader a no-no? I would think not since that pattern is used everywhere else but I'm out of ideas here.

erikwrede commented 1 year ago

To me this hints to an issue with session scoping, as mentioned above already. Scoped sessions should be thread bound. Depending on your implementation, you might close a session after the end of a request, causing a new session to be opened automatically for this thread scope upon the next request. Session.close calls Session.expunge_all, removing all objects from the session.

Long context, but this is a crucial step: once the object is expunged, it has no more reference to the session, thus causing none to be returned from the object session call in the data loaders.

So how can this happen? It seems like some of your resolvers return objects from a closed session.

But the normal Dataloaders are fine but randomly fail when you're on Gunicorn in prod? Let's try to dissect the differences between those environments then. On Gunicorn in prod/staging you are likely to have multiple worker threads running at the same time. Additionally, we can generally expect more requests. The problem doesn't appear on local where you're likely to either have a single busy worker at a time during testing or at least less load.

With this information, given my limited knowledge on your codebase my ideas are:

If that case sounds reasonable, can you try to explicitly create one dataloader per loader type per request by injecting the instance into the requests context and only using that instance throughout the request?

Maybe some of my ideas were helpful. Please understand that there is only limited advice I can give due to the lack of reproduction or code knowledge. keep me posted 😊

msmedes commented 1 year ago

I think the global dataloaders pattern is what bit us here. From the graphene docs:

DataLoader will coalesce all individual loads which occur within a single frame of execution
(executed once the wrapping event loop is resolved) and then call your batch function with all requested keys.

I mistakingly interpreted this as a single frame of execution would be scoped to a request. Guess not!

I threw together a PR last night to do as you suggested and inject a new dataloader per type per request (I'm actually lazily constructing and attaching them to a graphene.Context as needed rather than creating them all for every single request). Basically everyone is out for the holidays right now so we've paused releases so I won't know if it solves the issue until next week but will be sure to report back here.

Thanks for all your help!

erikwrede commented 1 year ago

@msmedes Yes, that dataloader doc could use some improvement. Happy to take any PRs if we get this working 🙂 Looking forward to hearing how it goes. Honestly my suggestion was a bit of a moonshot, so I'm curious to see if this fixes anything. But a bit of detective work is interesting sometimes and potentially helps with catching and avoiding future design issues. Hear from you in the new year 😊

msmedes commented 1 year ago

@erikwrede I wanted to wait a little time just to make sure but that appears to have done it. I really appreciate all the help, thanks!

erikwrede commented 1 year ago

Glad I could help, feel free to open another issue if anything else comes up :)

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related topics referencing this issue.