Closed thisnugroho closed 1 year ago
Hi @thisnugroho ,
So I see two potential issues. Firstly:
NO_COMPRESSION_64
If creating a zip with no compression, then unfortunately each file won't be streamed - all of its contents will be loaded into memory at the same time. This is to then be able to work out how long the uncompressed data is in order to output the size in the zip file before the data itself.
Maybe this could be worked around in some cases in the future, but that is how stream-zip works at the moment.
Right now, to solve you could instead use
ZIP_64
The other issue is this:
content = f.read()
This will also load the entire contents of this file to memory. This can be avoided using a different read_content
, passing it the file rather than its content:
def yield_content(f):
while chunk := f.read(65536):
yield f
Putting these together, the raw_content
function would be something like:
def raw_content(path):
for file in glob.iglob(f"{path}/**", recursive=True):
file = os.path.abspath(file)
if os.path.isdir(file): continue
with open(file, "rb") as f:
def yield_content(f):
while chunk := f.read(65536):
yield f
yield (
os.path.join('ojs3313', os.path.basename(file)),
datetime.datetime.now(),
0o777,
ZIP_64,
yield_content(f),
)
(Also 0o777
I think makes every single member file executable by everyone? Feels slightly risky - but I suspect unrelated to memory issues)
Thank you for your detailed explaination, I successfully zipped and uploaded the file to my storage following your instructions, but the file size is only 10 MB (compared to the original 1.5 GB). When I tried to download and extract it, the process failed. Is there a specific tool or command I need to use to extract it?
Update: I apologize, there was an error in my script. However, when I tried to run the script again, this error message appeared
Error: object of type '_io.BufferedReader' has no len()
current script look like this:
import zipfile
import io
import os
import glob
import datetime
from stream_zip import ZIP_64, ZIP_32, NO_COMPRESSION_64, NO_COMPRESSION_32, stream_zip
from storage import upload_obj
target="all"
def to_file_like_obj(iterable):
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
def raw_content(path):
for file in glob.iglob(f"{path}/**", recursive=True):
file_name = os.path.basename(file)
file = os.path.join(os.path.dirname(file), file_name)
if os.path.isdir(file): continue
with open(file, "rb") as f:
def yield_content(f):
while chunk := f.read(65536):
yield f
print(f"Zipping {file}, {datetime.datetime.now()}, {oct(os.stat(file).st_mode)}\n")
yield (file, datetime.datetime.now(), 0o777, ZIP_64 ,yield_content(f))
def main():
#raw_content(target)
zipped = stream_zip(raw_content(target))
zipped_obj = to_file_like_obj(zipped)
upload_obj(zipped_obj)
if __name__ == '__main__':
main()
(Also 0o777 I think makes every single member file executable by everyone? Feels slightly risky - but I suspect unrelated to memory issues)
yes, i'm aware of that, thank you,
Oh! I had a bug in my answer. So it shouldn't be:
def yield_content(f):
while chunk := f.read(65536):
yield f
but instead
def yield_content(f):
while chunk := f.read(65536):
yield chunk
Awesome it's work perfectly now, thank you so much for your help ! :sparkles:
Hi Michal, wow, thanks for creating such an incredibly well-written and powerful library!
I'm currently using Python's zipfile.ZipFile()
and it uses a chonky 8.5 GB RAM to generate a 3.5 GB ZIP file (and that's with me using .write(f.read())
, so I'll definitely be switching to stream-zip
in the coming days!
I've been looking through the source here and in your other projects and discussions, and everything is very impressive. You clearly have a deep understanding of the ZIP standard.
I'm planning to use a DEFLATE STORE (compression level 0)
mode since the data I am storing is already compressed (images etc), so there's no sense in wasting time trying to compress (even at level 9 I only saved 0.4% space).
I'll quote from this thread:
If creating a zip with no compression, then unfortunately each file won't be streamed - all of its contents will be loaded into memory at the same time. This is to then be able to work out how long the uncompressed data is in order to output the size in the zip file before the data itself.
Maybe this could be worked around in some cases in the future, but that is how stream-zip works at the moment.
I'm quoting your comment above because it gave me the wrong impression about there "being no way to store uncompressed data without issues". But it turned out that you were only referring to true uncompressed data. I then saw that you've implemented support for "uncompressed DEFLATE (level 0)" a while ago, here:
https://github.com/uktrade/stream-zip/issues/17#issuecomment-1221471985
Your comment there was:
Have now made it possible to set the zlib options for the zip file, by adding a
get_compressobj
function to thestream_zip
function. Released in v0.0.48.So to force no compression, and avoid buffering the file into memory, can use:
get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)
Related commit/usage instructions (for anyone else reading):
https://github.com/uktrade/stream-zip/pull/18/files
But thankfully your library has a nice function for automating this: ZLIB_AUTO()
, documented at https://stream-zip.docs.trade.gov.uk/modes/, which describes that it automatically chooses ZIP32 vs ZIP64 (based on input size and distance from start of ZIP), and also applies the desired compression level.
The ZIP_AUTO
implementation is here, for those who are curious:
And a usage example is visible in the "tests" file:
yield 'file-1', now, perms, ZIP_AUTO(4293656841, level=0), gen_bytes(4293656841)
That would generate a level 0 (store, no compression) DEFLATE entry. Awesome! That's one issue taken care of! :)
I saw your fixed code for efficiently reading files as smaller chunks:
def yield_content(f):
while chunk := f.read(65536):
yield chunk
I then noticed that the official "recipes documentation" contains the same algorithm, along with a usage example.
However, in that example it's generating a static tuple of all files immediately, and is immediately running a function which instantly opens the file and reads and yields the 1st chunk of every file into the files
tuple, which I am not a fan of.
Is there any way to defer that creation until the stream-zip
library's requests for that file's data actually begin? Sure, it's only 64 KB per file, so it's no big issue, but I'd love to avoid reading anything until each file is actually requested.
I guess the best solution is to rewrite that example to make it yield the files one by one, thereby only creating the chunk-readers when each file is being yielded? Right?
Edit: As explained by Michal below, the sample code is actually creating a generator and doesn't read anything from any file until that file is requested. Phew! :)
Are files always added to the ZIP archive sequentially, compressed one by one? I assume that's the case based on the fact that ZIPs seem to consist of individual chunks with 1 file at a time.
I ask because my file source doesn't like random seeking, so I hope to sequentially add each file one by one to the archive. I've read the stream-zip
source but didn't really understand the flow of the read_func()
in the for ... in files
loop.
stream_zip()
?So, I noticed that the stream_zip()
API takes a files
parameter whose type hasn't been specified, but the official docs for stream-zip
talks about using a premade tuple...
My use-case is to stream files dynamically from disk via a for p in Path(...).glob("**/*"):
along with transforming some of the files live (rewriting their contents/patching their contents before forwarding the data to the ZIP file).
I'm looking at the "Getting started: Generators" section and it seems like it would solve that:
https://stream-zip.docs.trade.gov.uk/getting-started/#generators
Would something like this work (in hastily written, untested code that I wrote here on GitHub)?
def files_generator(source: Path):
for p in source.glob("**/*"):
with p.open("rb") as f:
if p.name == "index.foo":
# Read entire file and patch it, generating a BytesIO readable buffer with the new data.
# NOTE: "with" is used to ensure that the BytesIO is closed, otherwise memory sticks around.
with io.BytesIO(patch_data(p.read())) as bf:
yield bf
else:
# Just yield the open file handle.
yield f
# IMPORTANT FOR FUTURE READERS: I intentionally skipped all
# of the REQUIRED tuple information such as filename/sizes/ZIP_AUTO,
# chunk read-generator, etc, to show more clarity about the actual
# buffer usage instead.
And then using the suggested "chunk reader" from bullet point 2 above, to .read()
from those sources on-demand?
I really look forward to trying all this after sleep! The library is clearly extremely powerful, and I just need to grok it! :)
😺
Hi @Arcitec,
Pleased it seems to be useful :-)
I then noticed that the official "recipes documentation"
However, in that example it's generating a static tuple of all files immediately, and is immediately running a function which instantly opens the file and reads and yields the 1st chunk of every file into the files tuple, which I am not a fan of.
I didn't think it does this, for 2 reasons.
Firstly, in the recipe:
return (
(name, now, 0o600, ZIP_32, contents(name))
for name in names
)
this doesn't return a tuple, but a generator expression. Like a list comprehension but it's not evaluated up front. I know the (...) makes it look like a... tuple-comprehension, but I don't think there is such a thing in Python. A simple example:
>>> print((a for a in [1,2]))
<generator object <genexpr> at 0x103520110>
And secondly, even if it was a tuple and all evaluated up front, contents(name)
doesn't actually open the file and fetch a chunk. Only once you start iterating over contents(name)
will it do that. And this isn't done by stream-zip until it gets to the file.
Are files always added to the ZIP archive sequentially, compressed one by one? I assume that's the case based on the fact that ZIPs seem to consist of individual chunks with 1 file at a time.
Yes
Would something like this work (in hastily written, untested code that I wrote here on GitHub)?
I think for the most part. I might tweak it a bit in a similarly hasty way.
def files_generator(source: Path):
for p in source.glob("**/*"):
with p.open("rb") as f:
if p.name == "index.foo":
yield (p.name, now, 0o600, ZIP_32, (patch_data(p.read()),))
else:
yield (p.name, now, 0o600, ZIP_32, chunk_reader(f))
Specifically, if you have the data in memory as a bytes
instance, then there isn't really a need to wrap it in anything other than a tuple to make it an iterable - it will just be an iterable of one.
Hi @michalc!
I just did my first preliminary test about half an hour ago and was gonna report back! Oh man... Python's zipfile
method uses 12.5 GB RAM at peak. And your stream-zip
method uses ................ 0.0065 GB RAM (6.5 MiB) at peak. So that's almost 2000x less RAM usage!
It also gives me per-file control over the compression level, so I can use 0 (store) or 1 ("fastest" of the compressed modes) on the incompressible files, and apply some balanced "level 6" or so to all other files, to gain a massive time benefit too! I will be able to cut the runtime by around -80% thanks to not having to apply level 6 compression to all files universally anymore. Only where compression is needed! This is mindblowing.
The API is also absolutely incredible. The whole generator/iterator concept is so beautifully designed in this library that there's so much flexibility for what a person can do with the inputs and output!
Thank you so much for creating this library and for the answers above!
I'll need to refine my code a bit more to finalize it as I learn this library. :)
Regarding what we spoke about above:
this doesn't return a tuple, but a generator expression. Like a list comprehension but it's not evaluated up front. I know the (...) makes it look like a... tuple-comprehension, but I don't think there is such a thing in Python.
Ahh, so the call to contents(name)
won't be evaluated until that generator-expression evaluates the "next()" item and reaches that item? That's such a cool technique, and I've never heard of it before!
Specifically, if you have the data in memory as a
bytes
instance, then there isn't really a need to wrap it in anything other than a tuple to make it an iterable - it will just be an iterable of one:(patch_data(p.read()),)
Ah that's a super nice technique too! At first I was ready to say "but... an iterator of a bytes
instance returns 1 byte at a time which would be super inefficient", but then I saw that you're wrapping it in an outer (foo,)
to turn it into a generator that yields the bytes
-object as its first and only item. So if anyone in the future reads this thread: Be careful and copy the generator parenthesis AND COMMA too.
:)
Although the interesting thing about an io.BytesIO()
is that it can absorb a bytes
immediately without copying (it just assigns its own internal ->buf
to point at the bytes-data. And then it provides a .read()
interface, which in turn would be compatible with a while chunk := f.read(65535): yield chunk
wrapper. So I'm curious if there's any memory benefit by using a BytesIO wrapper?
Imagine that you have a 100 MB bytes
object. If it's passed all at once, doesn't that mean that the compressor generates 100MB of ZIP data at once?
If so, it would make sense to use an io.Bytes(bytes)
wrapper and the chunked 64KiB-reader technique to ensure that the zipped data buffer never exceeds a certain memory limit.
But hey, since everything else in this library is so brilliant already, it wouldn't surprise me if you internally perform chunking during compression to ensure that you never generate huge internal buffers of compressed data. Wouldn't surprise me! :D
And if that's the case (if the compressor limits how much data it compresses regardless of the input iterator's chunk size), then passing the full bytes
object would definitely be the fastest since it wouldn't have to generate smaller 64 KiB chunks in RAM.
// Johnny
Ahh, so the call to contents(name) won't be evaluated until that tuple generator-expression evaluates the "next()" item and reaches that item?
Yes (other than I wouldn't call it a tuple generator-expression)
So I'm curious if there's any memory benefit by using a BytesIO wrapper?
I might have thought that a BytesIO wrapper would have a small cost in memory over wrapping in a tuple - just since a tuple is simpler than a BytesIO wrapper. Note that stream-zip does its own chunking from the iterable passed to it, so chunking again might have a bit of an increased cost, in both time and memory. I'm guessing a bit though...
Yes (other than I wouldn't call it a tuple generator-expression)
Oh yeah, sorry I got confused with the simultaneous tuple discussion. So to clarify it for future readers, I just realized that this is not a tuple and that the outer parenthesis combined with the for
loop is actually a "generator expression" <genexpr>
which creates a <generator>
object:
return (
(name, now, 0o600, ZIP_32, contents(name))
for name in names
)
I tried to figure out when they are evaluated and they're indeed lazy-evaluated:
def generator_process(item):
print("process", item)
return item.upper()
def generator_test(stuff):
print("run", stuff)
return ((generator_process(item),) for item in stuff)
gen = generator_test(["Aa", "Bb", "Cc"])
print("---")
for x in gen:
print(x)
print("***")
run ['Aa', 'Bb', 'Cc']
---
process Aa
('AA',)
***
process Bb
('BB',)
***
process Cc
('CC',)
***
So that's perfect. I learned a new way to make generators today! Thanks for that too! ❤️
So I'm curious if there's any memory benefit by using a BytesIO wrapper?
I might have thought that a BytesIO wrapper would have a small cost in memory over wrapping in a tuple - just since a tuple is simpler than a BytesIO wrapper. Note that stream-zip does its own chunking from the iterable passed to it, so chunking again might have a bit of an increased cost, in both time and memory. I'm guessing a bit though...
Yeah that makes perfect sense. There's definitely an extra cost with my idea! Wrapping in a BytesIO and doing the chunked reading would do the following steps:
BytesIO
object.->buf
pointer to a PyRef to the bytes
buffer..read(65535)
creates a new bytes
buffer with a copy of the original data (not sure if it's a memory view/slice which would be faster than a copy, but it's still yet another object).stream-zip
itself buffers and chunks the incoming chunks to get the proper compressed output chunk size, as you mentioned.So there's plenty of pointlessly wasteful steps there "for no good reason".
Whereas your solution of merely giving the full bytes
buffer directly as a (buf,)
tuple has a lot less complexity:
tuple()
which is pretty much the fastest object in Python (I've heard that it's heavily optimized since it is used everywhere in the multiple return/multiple assignment feature of the language).bytes
object (basically just creates a PyRef to the bytes
object itself).stream-zip
reads efficiently from that whole buffer, grabbing the chunks/slices/views that it needs, and it never risks "not having enough data and needing to wait for the next chunk" etc since it's all there immediately.So 100% right, people should definitely use the (buf,)
technique when the bytes already exist. Very smart idea. I really appreciate it.
In fact, your idea even gave me an idea for an optimization when compressing EMPTY files. I already do stat()
to get the size of all files to hand its information to stream-zip
. So if I detect a size == 0
I will simply pass a static (b"",)
(empty bytes) directly to stream-zip (instead of giving it the file's actual file handle/name), so that I can avoid the need for the "open/read/return" machinery for all empty files! :)
I feel much more ready to finish porting everything to stream-zip
now. To sum up the performance again, the difference was already massive in my first test today:
zipfile.ZipFile
writer:
stream-zip
writer:
The beauty of this library is incredible, my happiness is immeasurable and my day is made. :)
Hello, thank you for creating this stream-zip :sparkles:
i've question, i'm trying to create a zip of a directory and so many file in it ( total size 1.5 GB ) with this stream-zip, but in the middle of proccess my script got killed and i check the
/var/log/syslog
and found this error:Seems like it's out of memory, here is my code
do i have some missunderstanding here, or i implemented it ina wrong way ?
Thank you.