uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Do not use zero-memory-copy feature of zmq to prevent large memory footprint swings. #454

Closed selitvin closed 4 years ago

selitvin commented 4 years ago

Added zmq_copy_buffers argument to the ProcessPool constructor. It controls whether we will use copy=True/False argument of recv_multipart. Unfortunately, copy=False does not play nice with Python GC (at least with its default values of thresholds) and will result in wild memory footprint swings when working with large buffers (such as images). Copying memory appears to be a safer option for common user scenario. Control over this setting is currently not exposed to users of make_reader or make_batch_reader interfaces, but may be added in the future. In advanced scenarios, users may construct Reader object manually and configure it with a custom configured ProcessPool object.