Open winstxnhdw opened 6 days ago
Well I happen to implement that 😃
About until the DataFrame is created
- is the bottleneck here is creation of LazyFrames?
Because LazyFrames are actually always created eagerly from python (in init it creates regular DataFrame by iterating over all objects, then calls .lazy()).
And later collect_async does only join+with_columns part asynchronously.
To support something like this LazyFrames has to support "scanning" python iterables/generators, etc, which would actually be a great feature.
It does use polars threadpool here that runs collect on background, and then only resolves the future with GIL acquired. It should work with scan_parquet, scanscv, scan*, because they're executed on rust side in .collect and not eagerly.
About until the DataFrame is created - is the bottleneck here is creation of LazyFrames?
Hmm.. Is there a way I can verify this? The LazyFrame
only has ~50 rows but it is blocking the event loop for more than 10 seconds, which I think is an usually long time for creating such a trivial DataFrame
.
It does use polars threadpool here that runs collect on background, and then only resolves the future with GIL acquired.
I see that you are spawning the pool in Rust. I am not sure about PyO3 but in my experience, I found that spawning a separate background thread in a C++ extension would still block the event loop as Python still has to constantly poll the job for completion (which is blocking unless you spawn a thread from Python and wrap the future).
Well it's a different story then if it's only 50 rows. Care to provide data to reproduce?
Theoretically It shouldn't block event loop, because, because it's only resolving a future by calling this callback here at the end of exectuon of .collect on rust side, And does not hold GIL during actual collect.
I will get the data for you tomorrow.
not hold GIL during actual collect
Not holding the GIL is one thing, but it still needs to run in a separate thread for it to not block the event loop. If resolving the future involves some kind of polling of the results, the event loop will still be blocked.
Internally it may pool on the results, but if it were to pool by blocking whole loop that would defeat the whole purpose of Future
Hmm.. Is there a way I can verify this? (LazyFrame creation)
You can try to create LazyFrames on separate lines and time only that (or event just print-debug). Because init is sync.
You are right. LazyFrame's creation is indeed the one responsible for blocking the event loop. I took your advice and tried using scan_ndjson
but it was still blocking the event loop.
def get_intersection(transcription: IO[bytes], diarisation: IO[bytes]) -> Awaitable[DataFrame]:
intersection_expression = min_horizontal(
'end_time',
'end_time_right',
) - max_horizontal(
'start_time',
'start_time_right',
)
return (
scan_ndjson(transcription)
.join(scan_ndjson(diarisation), how='cross')
.with_columns(intersection_expression.alias('intersection'))
.collect_async()
)
When is the blocking happening? The output of the function is an awaitable so it seems like we need more info.
Are you doing something simple like this?
df_awaitable= get_intersection(tr, di)
df = await df
more like this?
asyncio.get_running_loop().run_until_complete(df_awaitable)
something else?
I am doing the former. I expect Awaitables to completely use the event loop. It defeats the purpose to use collect_async
if I were to also run it in a thread pool. I could just use collect
then.
You still have the issue that you're scanning an IO object so even though the polars binary would be non-blocking, it's getting its data from python which is blocking.
Try either making the inputs eager DFs or files.
In the case where the input is LazyFrame(transcription: dict)
you're going to have to wait for polars to parse through all its dicts to send to arrow memory.
In the case where the input is scan_ndjson(transcription: IO)
every time scan is looking at data it needs python which (I think) would be blocking.
So maybe try one of these
def get_intersection(transcription: pl.DataFrame, diarisation: pl.DataFrame) -> Awaitable[DataFrame]:
transcription=transcription.lazy()
diarisation=diarisation.lazy()
or
def get_intersection(transcription: Path, diarisation: Path) -> Awaitable[DataFrame]:
transcription=pl.scan_ndjson(transcription)
diarisation=pl.scan_ndjson(diarisation)
The problem is that polars
should be non-blocking end-to-end. This should be possible since reading from IO is async/await’s strength.
For now, collect_async
is redundant in practice, and running collect
in a thread pool + wrap futures would have the same effect. I suggest we add this to the documentation to reduce the amount of time wasted on this issue.
Checks
Reproducible example
Log output
No response
Issue description
In
polars
, Most of the CPU-bound activities happen in Rust where the Python GIL is dropped. Ideally,collect_async
should take advantage of this forpolars
to maximise CPU usage. As of right now,collect_async
will block the main event loop and stop your single worker server from handling any more requests until theDataFrame
is created.EDIT: DataFrame creation is blocking the event loop. We can fix this by running it in a separate thread.
Expected behavior
collect_async
should not block the main event loop and act as an actual async function that will allow Python to perform context switching and process other tasks that drop the GIL concurrently.Installed versions