kata198 / python-lrzip

Python bindings to LRZIP
GNU Lesser General Public License v3.0
4 stars 0 forks source link

possible memory leak (was: add fileobj/filehandler support) #1

Open johnnybubonic opened 7 years ago

johnnybubonic commented 7 years ago

I'd love to be able to do something like:


with open('/tmp/somefile.lrz') as f, \
  lrzip.compress(fileobj = f) as lrz, \
  tarfile.open(fileobj = lrz, mode = 'w') as tarball:
    tarball.add('/tmp/somedir', recursive = True)
(...)```
kata198 commented 7 years ago

Sorry I am just seeing this comment.

Can you explain more what you'd like? I'm happy to write the interface. I currently have it implemented per the python standard lib (like gzip module, bz2 module, lzma module on py3+) with decompress/compress method.

These standard modules also have a function "open", so you can do like:

import gzip
with gzip.open('blah.gz') as f:
    fileContents = f.read()

and use it as a file object. behind the scenes I would guess it reads the whole file and stores it in a stream, or maybe it does it block-at-a-time, but lrzip is not a block-compression algorithm like that, that's the "r" part of it, it does an extra pass on top of the compression algorithm chosen (like LZMA) to greatly increase compression ( I average 30% savings! ).

Would implementing that interface do what you want?

A tar file is a linear archive (tar stands for "Tape Archive" - linear data), so you can just append to the end ad infinium. lrzip doesn't work like that, it needs the full archive to perform the long-range zip and the chosen compression algorithm.

I could see your example working if it used the "open" interface like other compress modules, like:

with lrzip.open('/tmp/somefile.lrz', 'w') as f:
    tarfile.open(fileobj=lrz, mode='w') as tarball:
       tarball.add( ... )

In which case we would be able to just do the full job at "close" (maybe at "flush" as well, but given that most things are written with block algorithms in mind that can do 512-bytes at a time, and we'd have to do a full write and recompress, it might be better to just hook "close."

Thoughts? Expansions?

johnnybubonic commented 7 years ago

heh, now it's my turn to be late. :)

a lot of it is me not RTFMing; i didn't realize that lrzip isn't a block compression. from my understanding, e.g. gzip is (which is why it lets you do things like compressing data streams where the "end" isn't known).

i definitely think an "open" interface would make it "feel" more pythonic, but you do bring up good points about it. knowing now that lrzip isn't block-based, i don't think my request makes much sense now. (sorry!)

really the whole reason i brought it up is i suspected a memory leak perhaps? i run a keyserver in the SKS PGP keyserver pool, and i have a script to dump, compress, and sync those keys from a private box (specifically for this purpose) to the public keyserver (dumps are necessary for people who want to turn up a new keyserver).

ANYWAYS, the dumps split the server's key database into n dump files about 30MB each (looking at this morning's dump, the largest split is 46MB). currently, there are 319 parts of this dump. (if you'd like to examine the data itself, you're more than welcome to.)

to the point: this "prep box" has 2GB of RAM, and generally at any given time around 1.6GB free. however, when i run the dump and it begins the iteration to compress (line 157...179-182 in the script i linked above), after some time i get "failed" and "Killed" messages. the oddest part is that this worked briefly for a day or two. and if i run it from shell using the lrzip utility, it seems(?) to work fine.

here is some dmesg output occurring around the errors occur. you can see the various OOMs the kernel does because the python process is consuming too much memory. as shown, the anon-rss (real memory blocks mapped) is maxed out.

here's free output:

[root@dwarf 2017-09-19]# free
              total        used        free      shared  buff/cache   available
Mem:        1939768      174264      150140         576     1615364     1595576
Swap:       1048552      120720      927832

do you have any ideas as to why it'd be eating memory like that, based on those lines (179-182 in the script)? is the gc not c'ing the g correctly?