Open tchaton opened 7 months ago
Hey Team,
I would appreciate to get an answer :) This is blocking me. I am strongly thinking of dropping polars as a possible backend to PyTorch Lightning parquet backend.
Best, T.C
I think at the very least you need to use multiprocessing.get_context("spawn").Process
instead of Process
as mentioned in the guide from the previous issue.
(I don't have much knowledge on the topic, so I'm not sure if this will fix things.)
Read this: https://docs.pola.rs/user-guide/misc/multiprocessing/
It is not something we can fix. Python multiprocessing is badly designed and assumes current processes don't have any state in mutexes which is very, very unsafe assumption.
To quote the python mutliprocessing library:
The parent process uses [os.fork()](https://docs.python.org/3/library/os.html#os.fork) to fork the Python interpreter.
The child process, when it begins, is effectively identical to the parent process.
All resources of the parent are inherited by the child process.
Note that safely forking a multithreaded process is problematic.
In other words, use spawn
.
Hey @ritchie46, @stinodego, @alexander-beedie
Checks
Reproducible example
I am trying to distribute reading parquet files across workers and it seems polars loading time either increases or hangs.
It seems this a common issue as Img2Dataset got around lazy loading but actually re-generating the shards: https://github.com/rom1504/img2dataset/blob/main/img2dataset/reader.py#L189
Here is a reproducible script:
Log output
Here are the logs. As you can observe, the time just increases insanely.
In comparison with Pyarrow. Still not great but much better.
Issue description
Doing a partial lazy loading of parquet slice is a key component to distributed data processing across workers and machines.
Additionally, if I get the length using polars instead of pyarrow, it seems to hang. This might be a second bug.
Expected behavior
This is fast and reading time is constant.
Installed versions