noirello / pyorc

Python module for Apache ORC file format
Apache License 2.0
64 stars 21 forks source link

Memory leak in Writer? #4

Open JohnEmhoff opened 4 years ago

JohnEmhoff commented 4 years ago

Hello! Thanks for pyorc; using it has been a pleasure so far, with the exception that we seem to be running into memory issues. I think Writer is leaking memory? Our workload is roughly:

Memory usage will grow without bound between iterations. This, coupled with the fact that lowering the stripe size all the way down to 1M has no effect, makes me suspect a memory leak. Below is a script that will reproduce -- around iteration 10 it gets to 20G and then killed by the OOM killer on my machine. Let me know if there's anything I can do to help track it down!

https://gist.github.com/JohnEmhoff/274f6e05cba3f17a16683eb394bfe6b5

JohnEmhoff commented 4 years ago

I managed to trim down the script a good bit -- it turns out writing data is unnecessary, the leaks happen just creating writers:

https://gist.github.com/JohnEmhoff/55f562c2de701dfb426643a3e7751ef8

noirello commented 4 years ago

Thank you for reporting it.

I think I successfully pinpointed the problem when Writer's constructor build an orc::Type from the TypeDescription.

I'm still looking for the concrete source of the leak.

JohnEmhoff commented 4 years ago

Thanks for looking into it. I think you're right -- I noticed that when my spec in the script above is just a column or two, it leaks much, much more slowly.

noirello commented 4 years ago

After by passing the TypeDescription object still failed to run the iterations to the end. It seems like the orc:Writer object is somehow mishandled. Valgrind is not very helpful (although using it was never my strongest suit).

clynamen commented 4 years ago

I have the same problem. I tried to dig a bit and it seems the source of the leak is the creation of multiple ColumnWriter (of any type, string, float or int). The leak is proportional to the number of columns. Even more memory is leaked when ZLIB or ZSTD compression is enabled (currently enabled by default)

carlosfvp commented 3 years ago

Also I noticed the stripe size is not being honored. The stripe is not being flushed to disk and neither the memory freed (probably), but this part is being handled by the C++ library which make it harder to debug :(

carlosfvp commented 3 years ago

I found this recomendation. Using a method named writeIntermediateFooter will flush the content to the file and free some memory, but this only exist in the Java version of the OrcWriter 😥

https://www.mail-archive.com/issues@orc.apache.org/msg00225.html

https://orc.apache.org/api/orc-core/org/apache/orc/impl/WriterImpl.html#writeIntermediateFooter--

pokerc commented 3 years ago

Fund a similar problem, can not flush content to file manually, and batch_size in Writer parameter seems invalid. Any solutions?