Building recipes from files located within a large tar.gz file

jbusecke commented 1 year ago

I wanted to highlight a use case I have encountered multiple times in the past weeks and which is only partially supported by pangeo-forge-recipes.

The core situation always is the following:

The files that are source for a recipe are contained in a large compressed file (see https://github.com/pangeo-forge/staged-recipes/issues/219#issuecomment-1317335341) for an example. As a recipe builder I want to be able to work with the files contained in there and e.g. merge or concat them

As I learned, if the container is a .zip/.tar file you can already index into it, but .gzip does not have that possibility.

I wonder if there is a possibility to expand the functionality of pgf-recipes and allow some syntax that requires caching/unpacking of .gz files but still maintains something akin to a url per file.

If e.g. you have a file container.tar.gz which contains file1.nc and file2.nc and can be downloaded at http://zenodo.org/<project>/container.tar.gz.

Would it be at all possible to have some special command like GUNZIP that one could insert into a URL like this:

def build_urls(filenumber):
     return f`http://zenodo.org/<project>/container.tar.gz/UNZIP/file{filenumber}.nc`

If pgf-recipes could recognize this 'command' (there is probably a better word for this), then the recipe could just require the data to be cached locally, unpack it and do its usual thing?

rabernat commented 1 year ago

Thanks for opening this issue. I agree we need to support this workflow somehow, since these kinds of archives are unfortunately very common.

I think once #369 is done, it will be much more clear how to do this. Basically we will just create a custom PTransform to do the unzipping.

cisaacstern commented 1 year ago

I agree we'll still want to wait for beam-refactor to go in before approaching this, and the following is not necessarily a drop-in fix for this, but noting what seems to be a related line of work:

https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md

via

https://twitter.com/howardbutler/status/1612457687949901825?s=20&t=krbnOD1DVC6BeyEsfPz3_g

martindurant commented 1 year ago

I have read about sozip following @rabernat pointing it out to me elsewhere. I would add a couple of things:

ZIP already allows accessing of any contained file with a simple index lookup
fsspec can open files in a remote archive with URLs like "zip://memberfile::protocol://archive.path"
kerchunk can scan and build an index for uncompressed files within ZIP or TAR https://github.com/fsspec/kerchunk/blob/main/kerchunk/utils.py#L267
fsspec can open remote files with (gzip) compression and pass them to the tar filesystem. However, random access amounts to reading from the start every time. You could couple this with fsspec's cache to write out uncompressed versions of the member files to local or elsewhere, but it won't be as performant as curl | tar.
NOTHING can split up a gzip stream once it is written. If you have control over the writing, rather than trying to play tricks _don't_usegzip. bzip2, xz, zstd, blosc... all have internal stream blocks that can come close to random access with the right settings.

cisaacstern commented 1 year ago

Thanks for the clarifications, @martindurant!

rabernat commented 1 year ago

I spent a little time playing with python's tarfile and got the following little code snippet working

import fsspec
import tarfile

url = "https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1"

fp = fsspec.open(url)
tf = tarfile.open(fileobj=fp.open(), mode='r:gz')

while True:
    member = tf.next()
    if member is None:
        break
    print(member)

This could be the basis for a Beam PTransform that emits each file as an element.

https://gist.github.com/rabernat/616deabf2e12576f999470cbd82e9950

martindurant commented 1 year ago

The fsspec one-liner might be

allfiles = fsspec.open_files(
    "tar://*::https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1", 
    tar={"compression": "gzip"})
)

but it still must read the entire stream through. Other versions of the command are possible, but you can't get around gzip's single monolithic stream.

pangeo-forge / pangeo-forge-recipes

Building recipes from files located within a large tar.gz file #442