File.Open given to the handler for Tar.Extract can't be used after the handler has returned

ncw commented 1 year ago

What would you like to have changed?

Currently when you call Extract on a .zip file the handler method is given File objects with Open methods which can be called some time in the future, after the handler has returned.

This is because the zip reader is upgraded into an io.ReaderAt so the files can be seeked and opened later.

However if you do the same with Tar, the File.Open methods are only valid as long as the handler hasn't returned. If you call them after the handler has returned they return 0 bytes.

One solution would be to run Extract on the archive again with just the file name required. This would be an O(n^2) solution though :-( In rclone's case this would mean reading the entire tar file off remote storage in entirety for each file extracted which is super inefficient. Rclone could use local caching for this which would help somewhat.

For plain .tar files, they should be quite seekable in theory, but I don't think this is something that we can get from the standard library. For compressed .tar.gz for example one would need to keep the state of the decompressor at the start of each file in order to seek which is starting to sound very complicated. You can create seekable gzip files (rclone does this in its compress backend) but that is complicated and non-standard.

I think for rclone's purposes compressed tar files would mean you've got to read the entire file just to see the directory entries - I don't think there is any way around that. Where in theory at least you should be able to skip the actual file data when scanning an uncompressed .tar. I looked through the source of archive/tar and if the reader is capable of seeking then it uses it to skip data where necessary so that is good.

I think reading the seek position of the underlying reader each file in an uncompressed .tar could be interesting but that would need a fork of archive/tar.

Any ideas?

mholt commented 1 year ago

Finally had time to read this and digest the issue being described (sorry!) :sweat_smile:

You're right, this is a little annoying; I've encountered similar issues with the Extract method before: it's slow on large archives, like if you want to extract just one file from a big .tar archive, the file might be at the very end. Or if you want to list a directory, yes, you have to scan the entire archive. Memoization can help, like you mentioned. I haven't implemented that yet since I haven't decided on a good way to manage the memory implicitly in this case. But I'm open to it! (That way the caller wouldn't necessarily have to do it, possibly.)

I'm not sure seeking tar files is really useful unless you know the offset for every file in the archive; you'd have to scan the whole archive once first to build that index, since I don't think tar keeps one like zip does. It would be worth some experimenting to see if it's possible to get the byte offset reliably, especially using just the standard lib reader. If it doesn't do any/much buffering, then one possibility is to create a new method -- not sure what to call it, maybe Scan() or Index() -- that does a quick pass through the file to record the offsets as it goes. Then future lookups are O(1). (This would probably not work for compressed archives, like you mentioned.)

Ok, so now that I'm on the same page, I don't have a great solution to this either :( I've kind of come to accept the fact that tar archives can be slow to read in some cases -- and I should probably update the docs to mention that until we find a way to speed things up. But I'm very interested in speeding things up if possible.

I know tar files are the darling archive format of the Unix world, but frankly they're only good at one thing: streaming tape archives. They're not good if you need random access or structured file listings. :( In many ways, I do consider zip to be a superior format in that sense...

ncw commented 1 year ago

Finally had time to read this and digest the issue being described (sorry!) sweat_smile

Sorry it was a bit of a brain dump issue!

You're right, this is a little annoying; I've encountered similar issues with the Extract method before: it's slow on large archives, like if you want to extract just one file from a big .tar archive, the file might be at the very end. Or if you want to list a directory, yes, you have to scan the entire archive. Memoization can help, like you mentioned. I haven't implemented that yet since I haven't decided on a good way to manage the memory implicitly in this case. But I'm open to it! (That way the caller wouldn't necessarily have to do it, possibly.)

I can turn on caching for this in rclone and it will cache the archive to disk which would make the performance acceptable at the cost of disk usage. I could make this automatic for everything except .zips and that would probably satisfy most rclone users.

I'm not sure seeking tar files is really useful unless you know the offset for every file in the archive; you'd have to scan the whole archive once first to build that index, since I don't think tar keeps one like zip does.

You are right. It would require seeking the whole file to find out what files are within.

I was thinking this might be useful if you had a big tar with only a few files (imagine a 1GB tar with 10 x 100MB files) and just wanted to extract one of them, but with a tar with lots of files, seeking to each tar header in the stream will be very slow! (A seek in an HTTP stream means re-opening it with a new Range request and typically takes about 1s).

Most tars are compressed though, and seeking in a compressed tar is very hard. If the tar was compressed with gzip with sync points then it is easier, but that isn't standard.

It would be worth some experimenting to see if it's possible to get the byte offset reliably, especially using just the standard lib reader. If it doesn't do any/much buffering, then one possibility is to create a new method -- not sure what to call it, maybe Scan() or Index() -- that does a quick pass through the file to record the offsets as it goes. Then future lookups are O(1). (This would probably not work for compressed archives, like you mentioned.)

I think this is possible, but not with archive/tar. I looked at the code in that and there is a lot of state in the reader you'd have to replace to go back and read a file.

Ok, so now that I'm on the same page, I don't have a great solution to this either :( I've kind of come to accept the fact that tar archives can be slow to read in some cases -- and I should probably update the docs to mention that until we find a way to speed things up. But I'm very interested in speeding things up if possible.

I guess one possibility is to integrate this: https://pkg.go.dev/github.com/google/crfs@v0.0.0-20191108021818-71d77da419c9/stargz which reads and writes tars with individually compressed files and an index - that would seek perfectly! Yes that is a bit of bradfitz magic :-) Its worth reading the explainer on the parent package.

That could be another compression format for archiver to support - not sure how you'd deal with the overlap with the existing tar format, or if you can easily tell a stargz file from normal .tar.gz.

I know tar files are the darling archive format of the Unix world, but frankly they're only good at one thing: streaming tape archives. They're not good if you need random access or structured file listings. :( In many ways, I do consider zip to be a superior format in that sense...

Aye :-)

mholt commented 1 year ago

Thanks for the discussion!

Stargz looks like an improvement, although it is definitely non-standard... I'd be surprised if it was encountered in any significant number of places in the wild. I might be open to a proposal to how that would look/work with this library, but might only be convinced to accept it if it is popular enough.

Another option may be to simply not support the advanced/seeking feature for .tar (or at least compressed .tar) -- basically just support the functionality for file formats that are properly designed for such a thing. I know that kind of sucks, but so does the .tar format for seeking. :upside_down_face:

ncw commented 1 year ago

From an rclone point of view, only the .zip archiver works really well for reading. The in development archiver backend will write all the formats archiver supports but downloading is a problem for everything except .zip as rclone wants to read a file listing before transferring the files.

I guess I could do a backend specific command which would effectively run Extract on all the files if you wanted to be efficient about it. So an rclone backend extract :archive:drive:my.tar.gz destination:whatever command

I'm attracted to the stargz format as it would work well for reading and writing with the rclone backend so from that point of view it would be useful if it was part of Archiver.

If you want I can have a go at a PR

mholt commented 1 year ago

Sure, I'd be curious what a stargz PR would look like if it's not too invasive :slightly_smiling_face: :+1:

mholt commented 1 week ago

So, in #426, I clarified the documentation to state that files should not be used after the handler returns (they should be closed).

I gave this a lot more thought over the last few days and came to the same conclusion you started with, but I will still hope to see some enhancements in either the std lib or a tar package that can support seeking better once the offset of a target file is known.

mholt / archiver

File.Open given to the handler for Tar.Extract can't be used after the handler has returned #371

What would you like to have changed?