Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k
stars
281
forks
source link
PyTorch Batched Non-shuffle Buffer Large Memory Consumption #763
Batched non-shuffled buffer keeps creating copies of existing row group when making every single batch.
_make_batch() is called for creating every single batch.
This leads to large memory consumption issues when generating batches inside a very large row group.
We use the below scripts to test, and profile memory usage:
Batched non-shuffled buffer keeps creating copies of existing row group when making every single batch.
_make_batch()
is called for creating every single batch.This leads to large memory consumption issues when generating batches inside a very large row group. We use the below scripts to test, and profile memory usage:
Memory profiler is injected into https://github.com/uber/petastorm/blob/fa8a8812e49f9a0fa604f557a0513bcf4d74281e/petastorm/reader_impl/pytorch_shuffling_buffer.py#L98 for example:
The result for batch_size = 8192 and a 20M rows per rowgroup, 50 double type columns parquet file on HDFS is as below:
With applying PR #762 for reusing the same row-group buffer, we can bring down the memory usage.