Petastorm import causes performance degradation for dask distributed multiprocessing jobs

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Petastorm import causes performance degradation for dask distributed multiprocessing jobs #360

Open mike-grayhat opened 5 years ago

mike-grayhat commented 5 years ago

Seems like the way petastorm initializes multiprocessing pool on its import causes significant troubles for dask distributed multiprocessing jobs. Without petastorm import i have, let's say, 10 workers with 100% cpu load each, but if i import petastorm before running the task i get 10 workers with 10% cpu each. I assume it's quite rare case, but this side effect is very hard to debug.

selitvin commented 5 years ago

Curious, would using pyarrow_serialize=False argument to make_reader make a difference? We use pickle serialization for the IPC by default. The reason is that some types are not supported by pyarrow serialization. However, pyarrow serialization is much more efficient.

Down the road, we plan to use shared memory for communication between process pool workers and man process, which I hope would make a difference.

mike-grayhat commented 5 years ago

Maybe i didn't explain it well enough - let's say i have some task that i'm running in parallel using dask distributed multiprocessing. This task is not relying on petastorm. And let's say i'm doing it in jupyter notebook and save the results in petastorm format later with another downstream task. The problem is that if i just put import petastorm on top of the notebook the first distributed task will get performance degradation. So i'm not even using or calling any petastorm methods - just importing it - and getting performance related problems. I also found this warning in jupyter console output, maybe it's related to the problem: .../anaconda3/lib/python3.7/runpy.py:125: RuntimeWarning: 'petastorm.workers_pool.exec_in_new_process' found in sys.modules after import of package 'petastorm.workers_pool', but prior to execution of 'petastorm.workers_pool.exec_in_new_process'; this may result in unpredictable behaviour

selitvin commented 5 years ago

import petastorm would trigger about of loads that are heavy, such as pyarrow and pyspark. I'll investigate if we can make at least some of them to be loaded more lazy.

I am curious, if you observe the same behavior if you comment out either/both of the lines in petatsorm/__init__.py?

from petastorm.reader import make_reader, make_batch_reader  # noqa: F401
from petastorm.transform import TransformSpec  # noqa: F401

Don't know if you have time for it, but would be interesting to drill down to find which of a transitive imports actually causes this slowdown?

I never worked with dask myself, but will try to find some time to investigate as this might affect other users/frameworks as well.

mike-grayhat commented 5 years ago

Yes, petastorm import basically triggers a lot of internal configuration - i just put it as the simplest example so in theory one wouldn't expect such problems to appear even if importing, let's say, reader. I don't have much time for the investigations at the moment, but will try to help as soon as i get back to the related project.