Open jnak opened 4 years ago
My idea would be to port https://github.com/graphql/dataloader to Python, similarly to how GraphQL.js has been ported - it seems Syrus already started to work on this (https://github.com/syrusakbary/aiodataloader). Dataloader has seen a lot of new development recently, so this could be worthwile.
Nice! I didn't realize Syrus had made another version of Dataloader specifically for asyncio
. He seems he's indeed using loop.call_soon
to schedule batching on the next loop cycle.
Re the recent developments of Dataloader in JS, the task scheduling in Dataloader v2 uses the same underlying mechanism as in v1. Since Syrus' repo is a straight port from Dataloader JS v1, let's use it as a basis for the purpose of this conversation.
I see 2 issues to use it in a multi-threaded mode:
It is not thread safe because the Dataloader queue is not thread-scoped. Scheduled coroutines could be moved to a different thread, leading to all sort of bugs and security issues. We could fix that by subclassing threading.local (vs object).
It assumes it's run inside a single global event loop. We can remove the assumption that there is a single loop by not caching the value of get_event_loop()
, but this will still assume this is run inside an event loop.
Unless we wrap each graphql
call into its own event loop (see my snippet), I don't see how we can enable batching in multi-threaded mode. Do you have a sense if that's a crazy idea?
Currently I don't have enough time to look deeper into this and make a qualified assessment. Maybe you should examine what they are doing in other languages, e.g. https://github.com/graphql/dataloader/issues/30.
Sounds good. I look into it when I start working on upgrading to v3 (probably not before Q2/Q3 2020. In the meantime, let's keep this issue open in case someone needs this earlier than I do so we can discuss it here.
If someone is interested in working on this, refer to https://github.com/syrusakbary/promise/pull/81 to learn more about Dataloader and thread-safety in Python.
Looks like it might be possible to use asgiref
's sync_to_async and aiodataloader
to do this now.
import asyncio
import os
import time
os.environ['ASGI_THREADS'] = '16'
from aiodataloader import DataLoader
from asgiref.sync import sync_to_async
class TestLoader(DataLoader):
@sync_to_async
def batch_load_fn(self, keys):
t = time.time()
return [f'{key}-{t}' for key in keys]
async def main():
loader = TestLoader()
one1 = loader.load('one')
one2 = loader.load('one')
two = loader.load('two')
print(await one1, await one2)
print(await two)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
@silas I'm not sure why you need asgiref
's sync_to_async
here since aiodataloader.DataLoader.batch_load_fn
is already async. Also aiodataloader.DataLoader
is not thread-safe, so you'll run into concurrency issues if you call / await load
from multiple threads.
If you want to make this work in a multi-threaded environment, I recommend that you make aiodataloader.DataLoader
thread-safe by using threading.local
(see https://github.com/syrusakbary/promise/pull/81 for an example). Then you should be able to safely wrap the graphql
call with my run_batched_query
snippet above.
@jhgg batch_load_fn
is a sync function in this case, which aiodataloader.DataLoader
doesn't support (as far as I can tell).
You can't call load
from the sync code (because it's an async function). You could use asgiref's async_to_sync
which seems to hoist the call back up to the async thread (haven't confirmed), which should make it safe.
That said, I never call load
from batch_load_fn
or any of it's dependents, so it's fine for my use case.
Probably not a general solution, but it seems like it's possible to make it work now.
Why isn't the promise library supported anymore - or an equivalent? I want to use the dataloader pattern in a sync application. It's not even multi-threaded. All I want is to return an object from my resolver which will be resolved in a second tick, after the initial pass through execute
. The promise library seemed to handle this use case nicely.
@miracle2k The async feature is a modern equivalent to promises. It's the modern and official solution to the problem and supported by both Python syntax and standard library. Promises are not contained in the standard library.
@Cito Ok, but I don't want to use async in this code base, but I do want to use the data loader. I think this was a reasonable thing to have been supported previously.
So I am aware that I can do the following, and indeed this is what I am doing now and it works:
aiodataloader
library. Implement the async def batch_load_fn
function, but ignore the fact that it is async, and simply do a blocking call. import asyncio
loop = asyncio.new_event_loop()
async def run_async():
return await execute(
schema,
document,
variable_values=params.variables,
operation_name=params.operation_name,
**kwargs,
)
return loop.run_until_complete(run_async())
Now we are using asyncio as the "promise execution engine" for dataloader only, and otherwise the code is synchronous. I don't now how high the performance penalty is, but would guess it's minimal / similar to what it was with the promise
library. It feels pretty gross though? If this is the recommend way to do the dataloader pattern in sync mode, ok.
Note for anyone attempting this: In my particular case this required changes to aiodataloader
so it doesn't require an event loop to exist at __init__
time. but this can be avoided by simply moving the asyncio event loop higher up in the call stack.
Hi,
I've been thinking a bit about how we could implement the Dataloader pattern in v3 while still running in multi-threaded mode. Since v3 does not support Syrus's Promise library, we need to come up with a story for batching in async mode, as well as in multi-threaded environments. There are many libraries that do not support
asyncio
and there are many cases where it does not make sense to go fully async.As far as I understand, the only way to batch resolver calls from a single frame of execution would be to use
loop.call_soon
. But sinceasyncio
is not threadsafe, that means we would need to run a separate event loop in each worker thread. We would need to wrap thegraphql
call with something like this:Is that completely crazy? If yes, do you see a less hacky way? I'm not very familiar with
asyncio
so I would love to get feedback.Cheers