uktrade / stream-zip

Python function to construct a ZIP archive on the fly
https://stream-zip.docs.trade.gov.uk/
MIT License
111 stars 9 forks source link

Killed because Out of memory #24

Closed thisnugroho closed 1 year ago

thisnugroho commented 1 year ago

Hello, thank you for creating this stream-zip :sparkles:

i've question, i'm trying to create a zip of a directory and so many file in it ( total size 1.5 GB ) with this stream-zip, but in the middle of proccess my script got killed and i check the /var/log/syslog and found this error:

Feb  4 03:47:29 vps-db2c7be7 kernel: [11122.859200] Out of memory: Killed process 74114 (python3) total-vm:1584100kB, anon-rss:1430492kB, file-rss:2956kB, shmem-rss:0kB, UID:1000 pgtables:2896kB oom_score_adj:0

Seems like it's out of memory, here is my code

import zipfile
import io
import os
import glob
import datetime
from stream_zip import ZIP_64, ZIP_32, NO_COMPRESSION_64, NO_COMPRESSION_32, stream_zip
from storage import upload_obj

target="ojs_light"

def to_file_like_obj(iterable):
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj:
        def read(self, size=-1):
            return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))

    return FileLikeObj()

def raw_content(path):
    for file in glob.iglob(f"{path}/**", recursive=True):
        file = os.path.abspath(file)
        if os.path.isdir(file): continue
        with open(file, "rb") as f:
            content = f.read()
            def yield_content(content):
                yield content
            yield os.path.join('ojs3313', os.path.basename(file)), datetime.datetime.now(), 0o777, NO_COMPRESSION_64 ,yield_content(content)
def main():
    zipped = stream_zip(raw_content(target))
    zipped_obj = to_file_like_obj(zipped)
    # print(zipped_obj)
    upload_obj(zipped_obj)

if __name__ == '__main__':
    main()

do i have some missunderstanding here, or i implemented it ina wrong way ?

Thank you.

michalc commented 1 year ago

Hi @thisnugroho ,

So I see two potential issues. Firstly:

NO_COMPRESSION_64

If creating a zip with no compression, then unfortunately each file won't be streamed - all of its contents will be loaded into memory at the same time. This is to then be able to work out how long the uncompressed data is in order to output the size in the zip file before the data itself.

Maybe this could be worked around in some cases in the future, but that is how stream-zip works at the moment.

Right now, to solve you could instead use

ZIP_64

The other issue is this:

content = f.read()

This will also load the entire contents of this file to memory. This can be avoided using a different read_content, passing it the file rather than its content:

def yield_content(f):
    while chunk := f.read(65536):
        yield f

Putting these together, the raw_content function would be something like:

def raw_content(path):
    for file in glob.iglob(f"{path}/**", recursive=True):
        file = os.path.abspath(file)
        if os.path.isdir(file): continue
    with open(file, "rb") as f:
        def yield_content(f):
            while chunk := f.read(65536):
                yield f
        yield (
            os.path.join('ojs3313', os.path.basename(file)),
            datetime.datetime.now(),
            0o777,
            ZIP_64,
            yield_content(f),
        )

(Also 0o777 I think makes every single member file executable by everyone? Feels slightly risky - but I suspect unrelated to memory issues)

thisnugroho commented 1 year ago

Thank you for your detailed explaination, I successfully zipped and uploaded the file to my storage following your instructions, but the file size is only 10 MB (compared to the original 1.5 GB). When I tried to download and extract it, the process failed. Is there a specific tool or command I need to use to extract it?

image

Update: I apologize, there was an error in my script. However, when I tried to run the script again, this error message appeared

Error: object of type '_io.BufferedReader' has no len()

current script look like this:

import zipfile
import io
import os
import glob
import datetime
from stream_zip import ZIP_64, ZIP_32, NO_COMPRESSION_64, NO_COMPRESSION_32, stream_zip
from storage import upload_obj

target="all"

def to_file_like_obj(iterable):
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj:
        def read(self, size=-1):
            return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))

    return FileLikeObj()

def raw_content(path):
    for file in glob.iglob(f"{path}/**", recursive=True):
        file_name = os.path.basename(file)
        file = os.path.join(os.path.dirname(file), file_name)
        if os.path.isdir(file): continue
        with open(file, "rb") as f:
            def yield_content(f):
                while chunk := f.read(65536):
                    yield f

            print(f"Zipping {file}, {datetime.datetime.now()}, {oct(os.stat(file).st_mode)}\n")

            yield (file, datetime.datetime.now(), 0o777, ZIP_64 ,yield_content(f))
def main():
    #raw_content(target)
    zipped = stream_zip(raw_content(target))
    zipped_obj = to_file_like_obj(zipped)
    upload_obj(zipped_obj)

if __name__ == '__main__':
    main()

(Also 0o777 I think makes every single member file executable by everyone? Feels slightly risky - but I suspect unrelated to memory issues)

yes, i'm aware of that, thank you,

michalc commented 1 year ago

Oh! I had a bug in my answer. So it shouldn't be:

def yield_content(f):
    while chunk := f.read(65536):
        yield f

but instead

def yield_content(f):
    while chunk := f.read(65536):
        yield chunk
thisnugroho commented 1 year ago

Awesome it's work perfectly now, thank you so much for your help ! :sparkles:

Arcitec commented 1 year ago

Hi Michal, wow, thanks for creating such an incredibly well-written and powerful library!

I'm currently using Python's zipfile.ZipFile() and it uses a chonky 8.5 GB RAM to generate a 3.5 GB ZIP file (and that's with me using .write(f.read()), so I'll definitely be switching to stream-zip in the coming days!

I've been looking through the source here and in your other projects and discussions, and everything is very impressive. You clearly have a deep understanding of the ZIP standard.

I'm planning to use a DEFLATE STORE (compression level 0) mode since the data I am storing is already compressed (images etc), so there's no sense in wasting time trying to compress (even at level 9 I only saved 0.4% space).

1. Compression Level 0 (Store / No Compression)

I'll quote from this thread:

If creating a zip with no compression, then unfortunately each file won't be streamed - all of its contents will be loaded into memory at the same time. This is to then be able to work out how long the uncompressed data is in order to output the size in the zip file before the data itself.

Maybe this could be worked around in some cases in the future, but that is how stream-zip works at the moment.

I'm quoting your comment above because it gave me the wrong impression about there "being no way to store uncompressed data without issues". But it turned out that you were only referring to true uncompressed data. I then saw that you've implemented support for "uncompressed DEFLATE (level 0)" a while ago, here:

https://github.com/uktrade/stream-zip/issues/17#issuecomment-1221471985

Your comment there was:

Have now made it possible to set the zlib options for the zip file, by adding a get_compressobj function to the stream_zip function. Released in v0.0.48.

So to force no compression, and avoid buffering the file into memory, can use:

get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)

Related commit/usage instructions (for anyone else reading):

https://github.com/uktrade/stream-zip/pull/18/files

But thankfully your library has a nice function for automating this: ZLIB_AUTO(), documented at https://stream-zip.docs.trade.gov.uk/modes/, which describes that it automatically chooses ZIP32 vs ZIP64 (based on input size and distance from start of ZIP), and also applies the desired compression level.

The ZIP_AUTO implementation is here, for those who are curious:

https://github.com/uktrade/stream-zip/blob/31245ce4f89aeede3e260db1430f4fb42911a51d/stream_zip.py#LL25C1-L41C30

And a usage example is visible in the "tests" file:

https://github.com/uktrade/stream-zip/blob/31245ce4f89aeede3e260db1430f4fb42911a51d/test_stream_zip.py#LL129C37-L129C70

yield 'file-1', now, perms, ZIP_AUTO(4293656841, level=0), gen_bytes(4293656841)

That would generate a level 0 (store, no compression) DEFLATE entry. Awesome! That's one issue taken care of! :)

2. Efficient chunk reader...

I saw your fixed code for efficiently reading files as smaller chunks:

def yield_content(f):
    while chunk := f.read(65536):
        yield chunk

I then noticed that the official "recipes documentation" contains the same algorithm, along with a usage example.

However, in that example it's generating a static tuple of all files immediately, and is immediately running a function which instantly opens the file and reads and yields the 1st chunk of every file into the files tuple, which I am not a fan of.

Is there any way to defer that creation until the stream-zip library's requests for that file's data actually begin? Sure, it's only 64 KB per file, so it's no big issue, but I'd love to avoid reading anything until each file is actually requested.

I guess the best solution is to rewrite that example to make it yield the files one by one, thereby only creating the chunk-readers when each file is being yielded? Right?

Edit: As explained by Michal below, the sample code is actually creating a generator and doesn't read anything from any file until that file is requested. Phew! :)

3. Are files added sequentially?

Are files always added to the ZIP archive sequentially, compressed one by one? I assume that's the case based on the fact that ZIPs seem to consist of individual chunks with 1 file at a time.

I ask because my file source doesn't like random seeking, so I hope to sequentially add each file one by one to the archive. I've read the stream-zip source but didn't really understand the flow of the read_func() in the for ... in files loop.

4. Is there a way to dynamically add files one by one to stream_zip()?

So, I noticed that the stream_zip() API takes a files parameter whose type hasn't been specified, but the official docs for stream-zip talks about using a premade tuple...

My use-case is to stream files dynamically from disk via a for p in Path(...).glob("**/*"): along with transforming some of the files live (rewriting their contents/patching their contents before forwarding the data to the ZIP file).

I'm looking at the "Getting started: Generators" section and it seems like it would solve that:

https://stream-zip.docs.trade.gov.uk/getting-started/#generators

Would something like this work (in hastily written, untested code that I wrote here on GitHub)?

def files_generator(source: Path):
    for p in source.glob("**/*"):
        with p.open("rb") as f:
            if p.name == "index.foo":
                # Read entire file and patch it, generating a BytesIO readable buffer with the new data.
                # NOTE: "with" is used to ensure that the BytesIO is closed, otherwise memory sticks around.
                with io.BytesIO(patch_data(p.read())) as bf:
                    yield bf
            else:
                # Just yield the open file handle.
                yield f

# IMPORTANT FOR FUTURE READERS: I intentionally skipped all
# of the REQUIRED tuple information such as filename/sizes/ZIP_AUTO,
# chunk read-generator, etc, to show more clarity about the actual
# buffer usage instead.

And then using the suggested "chunk reader" from bullet point 2 above, to .read() from those sources on-demand?

I really look forward to trying all this after sleep! The library is clearly extremely powerful, and I just need to grok it! :)

😺

michalc commented 1 year ago

Hi @Arcitec,

Pleased it seems to be useful :-)

I then noticed that the official "recipes documentation"

However, in that example it's generating a static tuple of all files immediately, and is immediately running a function which instantly opens the file and reads and yields the 1st chunk of every file into the files tuple, which I am not a fan of.

I didn't think it does this, for 2 reasons.

Firstly, in the recipe:

return (
    (name, now, 0o600, ZIP_32, contents(name))
    for name in names
)

this doesn't return a tuple, but a generator expression. Like a list comprehension but it's not evaluated up front. I know the (...) makes it look like a... tuple-comprehension, but I don't think there is such a thing in Python. A simple example:

>>> print((a for a in [1,2]))
<generator object <genexpr> at 0x103520110>

And secondly, even if it was a tuple and all evaluated up front, contents(name) doesn't actually open the file and fetch a chunk. Only once you start iterating over contents(name) will it do that. And this isn't done by stream-zip until it gets to the file.

Are files always added to the ZIP archive sequentially, compressed one by one? I assume that's the case based on the fact that ZIPs seem to consist of individual chunks with 1 file at a time.

Yes

Would something like this work (in hastily written, untested code that I wrote here on GitHub)?

I think for the most part. I might tweak it a bit in a similarly hasty way.

def files_generator(source: Path):
    for p in source.glob("**/*"):
        with p.open("rb") as f:
            if p.name == "index.foo":
                yield (p.name, now, 0o600, ZIP_32, (patch_data(p.read()),))
            else:
                yield (p.name, now, 0o600, ZIP_32, chunk_reader(f))

Specifically, if you have the data in memory as a bytes instance, then there isn't really a need to wrap it in anything other than a tuple to make it an iterable - it will just be an iterable of one.

Arcitec commented 1 year ago

Hi @michalc!

I just did my first preliminary test about half an hour ago and was gonna report back! Oh man... Python's zipfile method uses 12.5 GB RAM at peak. And your stream-zip method uses ................ 0.0065 GB RAM (6.5 MiB) at peak. So that's almost 2000x less RAM usage!

It also gives me per-file control over the compression level, so I can use 0 (store) or 1 ("fastest" of the compressed modes) on the incompressible files, and apply some balanced "level 6" or so to all other files, to gain a massive time benefit too! I will be able to cut the runtime by around -80% thanks to not having to apply level 6 compression to all files universally anymore. Only where compression is needed! This is mindblowing.

The API is also absolutely incredible. The whole generator/iterator concept is so beautifully designed in this library that there's so much flexibility for what a person can do with the inputs and output!

Thank you so much for creating this library and for the answers above!

I'll need to refine my code a bit more to finalize it as I learn this library. :)

Regarding what we spoke about above:

this doesn't return a tuple, but a generator expression. Like a list comprehension but it's not evaluated up front. I know the (...) makes it look like a... tuple-comprehension, but I don't think there is such a thing in Python.

Ahh, so the call to contents(name) won't be evaluated until that generator-expression evaluates the "next()" item and reaches that item? That's such a cool technique, and I've never heard of it before!

Specifically, if you have the data in memory as a bytes instance, then there isn't really a need to wrap it in anything other than a tuple to make it an iterable - it will just be an iterable of one: (patch_data(p.read()),)

Ah that's a super nice technique too! At first I was ready to say "but... an iterator of a bytes instance returns 1 byte at a time which would be super inefficient", but then I saw that you're wrapping it in an outer (foo,) to turn it into a generator that yields the bytes-object as its first and only item. So if anyone in the future reads this thread: Be careful and copy the generator parenthesis AND COMMA too.

:)

Although the interesting thing about an io.BytesIO() is that it can absorb a bytes immediately without copying (it just assigns its own internal ->buf to point at the bytes-data. And then it provides a .read() interface, which in turn would be compatible with a while chunk := f.read(65535): yield chunk wrapper. So I'm curious if there's any memory benefit by using a BytesIO wrapper?

Imagine that you have a 100 MB bytes object. If it's passed all at once, doesn't that mean that the compressor generates 100MB of ZIP data at once?

If so, it would make sense to use an io.Bytes(bytes) wrapper and the chunked 64KiB-reader technique to ensure that the zipped data buffer never exceeds a certain memory limit.

But hey, since everything else in this library is so brilliant already, it wouldn't surprise me if you internally perform chunking during compression to ensure that you never generate huge internal buffers of compressed data. Wouldn't surprise me! :D

And if that's the case (if the compressor limits how much data it compresses regardless of the input iterator's chunk size), then passing the full bytes object would definitely be the fastest since it wouldn't have to generate smaller 64 KiB chunks in RAM.

// Johnny

michalc commented 1 year ago

Ahh, so the call to contents(name) won't be evaluated until that tuple generator-expression evaluates the "next()" item and reaches that item?

Yes (other than I wouldn't call it a tuple generator-expression)

So I'm curious if there's any memory benefit by using a BytesIO wrapper?

I might have thought that a BytesIO wrapper would have a small cost in memory over wrapping in a tuple - just since a tuple is simpler than a BytesIO wrapper. Note that stream-zip does its own chunking from the iterable passed to it, so chunking again might have a bit of an increased cost, in both time and memory. I'm guessing a bit though...

Arcitec commented 1 year ago

Yes (other than I wouldn't call it a tuple generator-expression)

Oh yeah, sorry I got confused with the simultaneous tuple discussion. So to clarify it for future readers, I just realized that this is not a tuple and that the outer parenthesis combined with the for loop is actually a "generator expression" <genexpr> which creates a <generator> object:

return (
    (name, now, 0o600, ZIP_32, contents(name))
    for name in names
)

I tried to figure out when they are evaluated and they're indeed lazy-evaluated:

def generator_process(item):
    print("process", item)
    return item.upper()

def generator_test(stuff):
    print("run", stuff)
    return ((generator_process(item),) for item in stuff)

gen = generator_test(["Aa", "Bb", "Cc"])
print("---")
for x in gen:
    print(x)
    print("***")
run ['Aa', 'Bb', 'Cc']
---
process Aa
('AA',)
***
process Bb
('BB',)
***
process Cc
('CC',)
***

So that's perfect. I learned a new way to make generators today! Thanks for that too! ❤️

So I'm curious if there's any memory benefit by using a BytesIO wrapper?

I might have thought that a BytesIO wrapper would have a small cost in memory over wrapping in a tuple - just since a tuple is simpler than a BytesIO wrapper. Note that stream-zip does its own chunking from the iterable passed to it, so chunking again might have a bit of an increased cost, in both time and memory. I'm guessing a bit though...

Yeah that makes perfect sense. There's definitely an extra cost with my idea! Wrapping in a BytesIO and doing the chunked reading would do the following steps:

  1. Create a BytesIO object.
  2. Set the ->buf pointer to a PyRef to the bytes buffer.
  3. Loop over the custom read-chunking generator which has to retrieve 64 KiB each time.
  4. Every .read(65535) creates a new bytes buffer with a copy of the original data (not sure if it's a memory view/slice which would be faster than a copy, but it's still yet another object).
  5. And finally stream-zip itself buffers and chunks the incoming chunks to get the proper compressed output chunk size, as you mentioned.

So there's plenty of pointlessly wasteful steps there "for no good reason".

Whereas your solution of merely giving the full bytes buffer directly as a (buf,) tuple has a lot less complexity:

  1. Create a tuple() which is pretty much the fastest object in Python (I've heard that it's heavily optimized since it is used everywhere in the multiple return/multiple assignment feature of the language).
  2. The tuple object instantly wraps the existing bytes object (basically just creates a PyRef to the bytes object itself).
  3. Finally, stream-zip reads efficiently from that whole buffer, grabbing the chunks/slices/views that it needs, and it never risks "not having enough data and needing to wait for the next chunk" etc since it's all there immediately.

So 100% right, people should definitely use the (buf,) technique when the bytes already exist. Very smart idea. I really appreciate it.

 

In fact, your idea even gave me an idea for an optimization when compressing EMPTY files. I already do stat() to get the size of all files to hand its information to stream-zip. So if I detect a size == 0 I will simply pass a static (b"",) (empty bytes) directly to stream-zip (instead of giving it the file's actual file handle/name), so that I can avoid the need for the "open/read/return" machinery for all empty files! :)

I feel much more ready to finish porting everything to stream-zip now. To sum up the performance again, the difference was already massive in my first test today:

zipfile.ZipFile writer:

stream-zip writer:

The beauty of this library is incredible, my happiness is immeasurable and my day is made. :)