Closed Mmiglio closed 5 years ago
Can you please take a look at this message? It explains how to estimate your memory footprint. Let me know if it does not help, so we can investigate further: https://github.com/uber/petastorm/issues/306#issuecomment-461511736
Thanks for the reply. I read the comment and I have one question: how do you compute the number of rows in a rowgroup? I can't understand how you get ~600 rows starting from a rowgroup size of 256MB. Because that can be the only issue since
memory footprint = 10 35MB + 5060KB ~ 350MB
Don't remember all the details of that thread. I would imagine that it would be row-group-size-in-mb/size-of-the-row-compressed.
You can double check this value by using parquet-tools
(https://github.com/apache/parquet-mr/tree/master/parquet-tools). You can also open your parquet dataset using some software API (e.g. pyarrow). You should be able to load a single row-group and look at the rows count.
Thanks for the help, I managed to solve the problem by tuning rowgroup_size_mb
and workers_count
.
Problem
Python process killed while creating a reader because it runs out of RAM.
How to reproduce
Dataset Generation
Dataset is generated by modifying the schema of the Hello World example. In my dataset I have two columns: an integer
id
and a nd array with shape801x19
.Issue
If I run
after a couple of seconds the process gets killed because it runs out of memory. What is causing it? I tried to play around with reader parameters and
rowgroup_size_mb
during the dataset generation, but I didn't find a solution.