uktrade / stream-unzip

Python function to stream unzip all the files in a ZIP archive on the fly
https://stream-unzip.docs.trade.gov.uk/
MIT License
276 stars 11 forks source link

[question] asynchio Zip File of Zipped Chunks #24

Closed gkedge closed 3 years ago

gkedge commented 3 years ago

Consider streaming in a zip file.

def zipped_chunks(zipfile_name: PurePath):
    # Iterable that yields the bytes of a zip file
    with open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
        yield zip_f.read()

I am attempting to unzip a number of large zip files concurrently that are hosted on a slow network drive. Do you see any value in leveraging aiofiles package to stream the read like so?:

async def zipped_chunks(zipfile_name: PurePath):
    # Iterable that yields the bytes of a zip file
    async with aiofiles.open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
        yield await zip_f.read()

async def unzip_tar_files(self, zipfile_name: PurePath):
    chunks: List[bytes] = [data async for data in self.zipped_chunks(zipfile_name)]
    for file_name, tar_file_size, unzipped_chunks in stream_unzip(chunks):
        ....

That seems to work well for me (so far). Do you see any downside?

If not, it might be a nice addition to the README as I have finally come across a program I am writing from scratch that benefits from leveraging an asyincio solution with stream-unzip being a key part of that solution. Took me forever to understand that only list comprehension supports async iteration.

michalc commented 3 years ago

Hello 👋

At the moment, I don't think stream-unzip can "async" stream unzip. It would need to accept and return async iterables, rather than just iterables. This might not be hard to achieve really: it would essentially need a copy+paste of stream_unzip with async put in function definitions, for loops etc (or some equivalent involving factoring out common code). But I don't think it's possible just with client code.

I also don't think the example given is really streaming. For 2 reasons:

I have to admit ignorance of aiofiles, but right now, for the reasons above, I'm not pro this getting added to the README

gkedge commented 3 years ago

Thank you for the thoughtful and thorough answer! I have learned the hard way that I was definitely off-the-mark here. In other words, if I may partially quote you: I have to admit ignorance of asyncio! In the absence of an asyncio unzip, stream-unzip is a fine substitute were though I may not be able to achieve levels of concurrency that I had hoped, I can manage memory better. So, thank so much for stream-unzip and your very helpful response.