Open mike-grayhat opened 5 years ago
Curious, would using pyarrow_serialize=False
argument to make_reader
make a difference? We use pickle serialization for the IPC by default. The reason is that some types are not supported by pyarrow serialization. However, pyarrow serialization is much more efficient.
Down the road, we plan to use shared memory for communication between process pool workers and man process, which I hope would make a difference.
Maybe i didn't explain it well enough - let's say i have some task that i'm running in parallel using dask distributed multiprocessing. This task is not relying on petastorm. And let's say i'm doing it in jupyter notebook and save the results in petastorm format later with another downstream task. The problem is that if i just put import petastorm
on top of the notebook the first distributed task will get performance degradation. So i'm not even using or calling any petastorm methods - just importing it - and getting performance related problems.
I also found this warning in jupyter console output, maybe it's related to the problem:
.../anaconda3/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour
import petastorm
would trigger about of loads that are heavy, such as pyarrow and pyspark. I'll investigate if we can make at least some of them to be loaded more lazy.
I am curious, if you observe the same behavior if you comment out either/both of the lines in petatsorm/__init__.py
?
from petastorm.reader import make_reader, make_batch_reader # noqa: F401
from petastorm.transform import TransformSpec # noqa: F401
Don't know if you have time for it, but would be interesting to drill down to find which of a transitive imports actually causes this slowdown?
I never worked with dask myself, but will try to find some time to investigate as this might affect other users/frameworks as well.
Yes, petastorm import basically triggers a lot of internal configuration - i just put it as the simplest example so in theory one wouldn't expect such problems to appear even if importing, let's say, reader. I don't have much time for the investigations at the moment, but will try to help as soon as i get back to the related project.
Seems like the way petastorm initializes multiprocessing pool on its import causes significant troubles for dask distributed multiprocessing jobs. Without petastorm import i have, let's say, 10 workers with 100% cpu load each, but if i import petastorm before running the task i get 10 workers with 10% cpu each. I assume it's quite rare case, but this side effect is very hard to debug.