Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k
stars
284
forks
source link
Do not use zero-memory-copy feature of zmq to prevent large memory footprint swings. #454
Added zmq_copy_buffers argument to the ProcessPool constructor. It controls whether we will use copy=True/False argument of recv_multipart.
Unfortunately, copy=False does not play nice with Python GC (at least with its default values of thresholds) and will result in wild memory footprint swings when working with large buffers (such as images).
Copying memory appears to be a safer option for common user scenario. Control over this setting is currently not exposed to users of make_reader or make_batch_reader interfaces, but may be added in the future.
In advanced scenarios, users may construct Reader object manually and configure it with a custom configured ProcessPool object.
Added
zmq_copy_buffers
argument to theProcessPool
constructor. It controls whether we will use copy=True/False argument of recv_multipart. Unfortunately, copy=False does not play nice with Python GC (at least with its default values of thresholds) and will result in wild memory footprint swings when working with large buffers (such as images). Copying memory appears to be a safer option for common user scenario. Control over this setting is currently not exposed to users ofmake_reader
ormake_batch_reader
interfaces, but may be added in the future. In advanced scenarios, users may constructReader
object manually and configure it with a custom configuredProcessPool
object.