Open jbusecke opened 1 year ago
Thanks for opening this issue. I agree we need to support this workflow somehow, since these kinds of archives are unfortunately very common.
I think once #369 is done, it will be much more clear how to do this. Basically we will just create a custom PTransform to do the unzipping.
I agree we'll still want to wait for beam-refactor
to go in before approaching this, and the following is not necessarily a drop-in fix for this, but noting what seems to be a related line of work:
https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md
via
https://twitter.com/howardbutler/status/1612457687949901825?s=20&t=krbnOD1DVC6BeyEsfPz3_g
I have read about sozip following @rabernat pointing it out to me elsewhere. I would add a couple of things:
curl | tar
.Thanks for the clarifications, @martindurant!
I spent a little time playing with python's tarfile and got the following little code snippet working
import fsspec
import tarfile
url = "https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1"
fp = fsspec.open(url)
tf = tarfile.open(fileobj=fp.open(), mode='r:gz')
while True:
member = tf.next()
if member is None:
break
print(member)
This could be the basis for a Beam PTransform that emits each file as an element.
https://gist.github.com/rabernat/616deabf2e12576f999470cbd82e9950
The fsspec one-liner might be
allfiles = fsspec.open_files(
"tar://*::https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1",
tar={"compression": "gzip"})
)
but it still must read the entire stream through. Other versions of the command are possible, but you can't get around gzip's single monolithic stream.
I wanted to highlight a use case I have encountered multiple times in the past weeks and which is only partially supported by pangeo-forge-recipes.
The core situation always is the following:
The files that are source for a recipe are contained in a large compressed file (see https://github.com/pangeo-forge/staged-recipes/issues/219#issuecomment-1317335341) for an example. As a recipe builder I want to be able to work with the files contained in there and e.g. merge or concat them
As I learned, if the container is a .zip/.tar file you can already index into it, but .gzip does not have that possibility.
I wonder if there is a possibility to expand the functionality of pgf-recipes and allow some syntax that requires caching/unpacking of .gz files but still maintains something akin to a url per file.
If e.g. you have a file
container.tar.gz
which containsfile1.nc
andfile2.nc
and can be downloaded athttp://zenodo.org/<project>/container.tar.gz
.Would it be at all possible to have some special command like
GUNZIP
that one could insert into a URL like this:If pgf-recipes could recognize this 'command' (there is probably a better word for this), then the recipe could just require the data to be cached locally, unpack it and do its usual thing?