pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

Building recipes from files located within a large tar.gz file #442

Open jbusecke opened 1 year ago

jbusecke commented 1 year ago

I wanted to highlight a use case I have encountered multiple times in the past weeks and which is only partially supported by pangeo-forge-recipes.

The core situation always is the following:

The files that are source for a recipe are contained in a large compressed file (see https://github.com/pangeo-forge/staged-recipes/issues/219#issuecomment-1317335341) for an example. As a recipe builder I want to be able to work with the files contained in there and e.g. merge or concat them

As I learned, if the container is a .zip/.tar file you can already index into it, but .gzip does not have that possibility.

I wonder if there is a possibility to expand the functionality of pgf-recipes and allow some syntax that requires caching/unpacking of .gz files but still maintains something akin to a url per file.

If e.g. you have a file container.tar.gz which contains file1.nc and file2.nc and can be downloaded at http://zenodo.org/<project>/container.tar.gz.

Would it be at all possible to have some special command like GUNZIP that one could insert into a URL like this:

def build_urls(filenumber):
     return f`http://zenodo.org/<project>/container.tar.gz/UNZIP/file{filenumber}.nc`

If pgf-recipes could recognize this 'command' (there is probably a better word for this), then the recipe could just require the data to be cached locally, unpack it and do its usual thing?

rabernat commented 1 year ago

Thanks for opening this issue. I agree we need to support this workflow somehow, since these kinds of archives are unfortunately very common.

I think once #369 is done, it will be much more clear how to do this. Basically we will just create a custom PTransform to do the unzipping.

cisaacstern commented 1 year ago

I agree we'll still want to wait for beam-refactor to go in before approaching this, and the following is not necessarily a drop-in fix for this, but noting what seems to be a related line of work:

https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md

via

https://twitter.com/howardbutler/status/1612457687949901825?s=20&t=krbnOD1DVC6BeyEsfPz3_g

martindurant commented 1 year ago

I have read about sozip following @rabernat pointing it out to me elsewhere. I would add a couple of things:

cisaacstern commented 1 year ago

Thanks for the clarifications, @martindurant!

rabernat commented 1 year ago

I spent a little time playing with python's tarfile and got the following little code snippet working

import fsspec
import tarfile

url = "https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1"

fp = fsspec.open(url)
tf = tarfile.open(fileobj=fp.open(), mode='r:gz')

while True:
    member = tf.next()
    if member is None:
        break
    print(member)

This could be the basis for a Beam PTransform that emits each file as an element.

https://gist.github.com/rabernat/616deabf2e12576f999470cbd82e9950

martindurant commented 1 year ago

The fsspec one-liner might be

allfiles = fsspec.open_files(
    "tar://*::https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1", 
    tar={"compression": "gzip"})
)

but it still must read the entire stream through. Other versions of the command are possible, but you can't get around gzip's single monolithic stream.