Open wking opened 8 years ago
Thanks for filing this. I explored this somewhat and I'm opposed to providing this functionality for the following reasons:
tar.Header.Bytes
field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter
on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.Bytes
field when writing? Do we simply ignore it? Do we write Bytes
instead of what the Writer would have generated? We could document the exact behavior, but I'm not fond of a header field that is only valid for the Reader, but not for the Writer. As an API it behaves in unexpected ways.tar
archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.On Sat, Oct 29, 2016 at 02:46:22PM -0700, Joe Tsai wrote:
- The requested
tar.Header.Bytes
field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using anio.TeeWriter
on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.
I agree that tar.Header.Bytes is hacky and am open to the TeeWriter approach. However, it's possible that tar.Reader.Next skips elides two headers (e.g. an ‘x’ and a ‘g’). In that case, I'm not sure how you'd find the middle header short of implementing your own parser. You could drop tar.Header.Bytes (in favor of the TeeWriter approach) if you had NextRaw to give single steps.
- I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the
tar
archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.
Forking is so drastic, when NextRaw seems pretty straightforward.
- A better API might be to add a Format field to Header… However…
I agree that this is a bad fit for the stdlib for the reasons you give. I think NextRaw and TeeWriter would give sufficient access to graft this sort of thing onto the stdlib's reader. Do you have problems with the NextRaw interface? Would it help if I worked up a patch implementing it?
I think a Header.Bytes
field is off the table given:
An API to add NextRaw
is still a possibility, but I have many concerns.
NextRaw
do when it decodes a old GNU header for a sparse file? Do we parse the extended blocks as well? (I would assume yes, otherwise the next NextRaw
call will fail). This seems to break your assumption that "raw" always means 512B.NextRaw
is relevant to properly parse in the next call to NextRaw
? For example, let's say the next header is for the PAX headers, and the "size" record is set. When we call NextRaw
again, how does it know about information from a previous raw header (like the filesize) that is important to parse the next file?Next
and NextRaw
calls?On Wed, Nov 09, 2016 at 01:55:26PM -0800, Joe Tsai wrote:
I think a
Header.Bytes
field is off the table given:
- It's low number of use-cases
- A workaround with io.TeeWriter
Agreed, although the TeeWriter approach depends on NextRaw or similar 1.
An API to add
NextRaw
is still a possibility, but I have many concerns…
These are good points. The main issue with Bytes access and tar validation is that we need access to the tip of the entry point. How about:
// Next2 is just like Next, but if callback is non-nil it is called // after parsing a header block (512 bytes after the start of the // record). A single Next2 call may result in multiple callback // calls, e.g. in the case of pax extended headers. If callback // returns an error, Next2 returns the same error and does no // further reading. For records with GNU's "extra" or "sparse" // header extensions, callback will be called before the additional // headers are parsed, which means the values set in the callback's // header argument may be incomplete. If the callback returns with // no errors, Next2 will continue parsing as usual. Callbacks must // not modify the passed tar reader internally unless they can do so // without altering the state of the Reader reference assocaited // with Next2. func (tr _Reader) Next2(callback *HeaderCallback) (_Header, error)
type HeaderCallback(tar Reader, header Header) error
That seems reasonable to me, although the “don't break the reader” warning is a bit awkward. You could avoid that with:
type HeaderCallback(header *Header) error
but folks might want to use the Reader reference in a non-desctuctive way. For example, Python exposes global pax headers on the Reader/Writer 2, and Go might decide to handle global pax headers the same way (as an alternative to packing them all into the same Header map 3). On the other hand, folks can always use a closure or some such with the Header-only form if they do need a reference to the reader, so I'm ok either way.
Before we dig deeper into this API (which seems fairly complex), I want to understand the use case more. How does Next2
help achieve the goal of "validating the tar file"? Can you give practical and specific things that you would be checking for it to be "valid"?
a possible option (and one that I would love to reconcile with golang upstream) is the approach taken for https://github.com/vbatts/tar-split.
Currently we carry archive/tar
but add a functionality of enabling access to raw bytes.
See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar
The only additions to the archive/tar
API is the RawAccounting bool
field and the RawBytes()
func on the Reader
struct.
(coincidentally I discovered this thread while preparing to open the discussion for pushing the RawBytes
feature to upstream)
On Mon, Jan 16, 2017 at 01:54:05PM -0800, Vincent Batts wrote:
See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the
archive/tar
API is theRawAccounting bool
field and theRawBytes()
func on theReader
struct.
Does this collect all the bytes (from both the headers and payloads) between RawBytes calls when RawAccounting is set? If so, it sounds a lot like the io.TeeWriter approach 1. Is it just a more conventient API for that behavior, or does it give you a way to trigger on tar headers that are currently elided by the upstream Go library (e.g. pax x typeflag records)?
It does not seem like the RawBytes
API provides a way to trigger on elided headers.
As @wking mention, it seems that the RawBytes
API is identical to using a TeeReader
around the source io.Reader
except the RawBytes
does not record the body.
It seems that this could be accomplished with a modified TeeReader
approach: https://play.golang.org/p/ZUanaldFcu
However, my modification falls short in the following two areas:
Next
, which incorrectly causes the TeeReader
to read padding that is not part of the Header.Next
, otherwise part of the body is accidentally copied into TeeReader
. This is inefficient when the body is a sparse file or the input io.Reader
is actually a io.Seeker
and data could have been quickly discarded.If those are the only two deficiencies (correct me if I'm wrong), I would much rather see the following happen instead:
Read
function guarantees that it consumes the padding when it hits io.EOF
. This could be documented behavior with a test ensuring it's persistence in the standard library.Reader.Discard
method to tell the read to drop some number of bytes. A discard method is useful for the handling of sparse files (see my comment on #13548).The first change requires no API change and is something I would want to see anyways. The later API change at least provides benefit for another functionality more popularly requested of the tar
package.
@dsnet understandable, but the need was in accessing only the raw headers and padding, and not the body of each entry, as the tar.Reader.Read() already provides this. @wking in this way, it does allow for collecting all the bytes. That's the purpose of tar-split, such that all the non-payload bytes can be preserved for re-assembling the cryptographic-equivalent original.
There is nothing for triggering on tar headers. Though a parser of the RawBytes ought not be too terrible.
Additionally, that RawBytes approach does not affect the tar.Writer at all. It is only for reading.
The current master implementation elides records, for example pax
x
typeflag records are collapsed into the following record with the extended headers being merged into the normal file's headers. This is useful for most consumers, who are only interested in consuming the tar content. But two use cases would benefit from more direct access to the lower-level tar details:tar.Header
does not currently expose pax extended headers outside of these names and prefixes. There are plans to rectify this limitation (#14472), but until the tar structures expose all of these sorts of details it will be useful to have a way to directly access the header and file data. This will allow users to add their own parsers for any extension they need which is not yet supported by the stdlib, without requiring them to fork archive/tar or replace it with a completely novel parser.Next()
never returns thex
typeflag headers).I think we can address both of these cases without breaking the existing API with two changes:
tar.Reader.NextRaw()
that works liketar.Reader.Next()
, but always returns the next record regardless of its typeflag.tar.Header.Bytes
exposing a[]byte
slice of the raw header data, to allow clients to access header fields that are not yet mapped to othertar.Header
attributes. Alternatively, this could be a methodtar.Header.Bytes()
(or an attribute exposing astring
or whatever) to make the data clearly read-only.Folks who wish to control writing at a low level may want similar changes on the
tar.Writer
side, but I think that is orthogonal enough to punt to a separate issue (if anyone wants it).CC @dsnet.