golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.84k stars 17.51k forks source link

archive/tar: add Reader.NextRaw method to read only one raw header #17657

Open wking opened 7 years ago

wking commented 7 years ago

The current master implementation elides records, for example pax x typeflag records are collapsed into the following record with the extended headers being merged into the normal file's headers. This is useful for most consumers, who are only interested in consuming the tar content. But two use cases would benefit from more direct access to the lower-level tar details:

I think we can address both of these cases without breaking the existing API with two changes:

  1. Adding a tar.Reader.NextRaw() that works like tar.Reader.Next(), but always returns the next record regardless of its typeflag.
  2. Adding tar.Header.Bytes exposing a []byte slice of the raw header data, to allow clients to access header fields that are not yet mapped to other tar.Header attributes. Alternatively, this could be a method tar.Header.Bytes() (or an attribute exposing a string or whatever) to make the data clearly read-only.

Folks who wish to control writing at a low level may want similar changes on the tar.Writer side, but I think that is orthogonal enough to punt to a separate issue (if anyone wants it).

CC @dsnet.

dsnet commented 7 years ago

Thanks for filing this. I explored this somewhat and I'm opposed to providing this functionality for the following reasons:

wking commented 7 years ago

On Sat, Oct 29, 2016 at 02:46:22PM -0700, Joe Tsai wrote:

  • The requested tar.Header.Bytes field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.

I agree that tar.Header.Bytes is hacky and am open to the TeeWriter approach. However, it's possible that tar.Reader.Next skips elides two headers (e.g. an ‘x’ and a ‘g’). In that case, I'm not sure how you'd find the middle header short of implementing your own parser. You could drop tar.Header.Bytes (in favor of the TeeWriter approach) if you had NextRaw to give single steps.

  • I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the tar archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.

Forking is so drastic, when NextRaw seems pretty straightforward.

  • A better API might be to add a Format field to Header… However…

I agree that this is a bad fit for the stdlib for the reasons you give. I think NextRaw and TeeWriter would give sufficient access to graft this sort of thing onto the stdlib's reader. Do you have problems with the NextRaw interface? Would it help if I worked up a patch implementing it?

dsnet commented 7 years ago

I think a Header.Bytes field is off the table given:

An API to add NextRaw is still a possibility, but I have many concerns.

wking commented 7 years ago

On Wed, Nov 09, 2016 at 01:55:26PM -0800, Joe Tsai wrote:

I think a Header.Bytes field is off the table given:

  • It's low number of use-cases
  • A workaround with io.TeeWriter

Agreed, although the TeeWriter approach depends on NextRaw or similar 1.

An API to add NextRaw is still a possibility, but I have many concerns…

These are good points. The main issue with Bytes access and tar validation is that we need access to the tip of the entry point. How about:

// Next2 is just like Next, but if callback is non-nil it is called // after parsing a header block (512 bytes after the start of the // record). A single Next2 call may result in multiple callback // calls, e.g. in the case of pax extended headers. If callback // returns an error, Next2 returns the same error and does no // further reading. For records with GNU's "extra" or "sparse" // header extensions, callback will be called before the additional // headers are parsed, which means the values set in the callback's // header argument may be incomplete. If the callback returns with // no errors, Next2 will continue parsing as usual. Callbacks must // not modify the passed tar reader internally unless they can do so // without altering the state of the Reader reference assocaited // with Next2. func (tr _Reader) Next2(callback *HeaderCallback) (_Header, error)

type HeaderCallback(tar Reader, header Header) error

That seems reasonable to me, although the “don't break the reader” warning is a bit awkward. You could avoid that with:

type HeaderCallback(header *Header) error

but folks might want to use the Reader reference in a non-desctuctive way. For example, Python exposes global pax headers on the Reader/Writer 2, and Go might decide to handle global pax headers the same way (as an alternative to packing them all into the same Header map 3). On the other hand, folks can always use a closure or some such with the Header-only form if they do need a reference to the reader, so I'm ok either way.

dsnet commented 7 years ago

Before we dig deeper into this API (which seems fairly complex), I want to understand the use case more. How does Next2 help achieve the goal of "validating the tar file"? Can you give practical and specific things that you would be checking for it to be "valid"?

vbatts commented 7 years ago

a possible option (and one that I would love to reconcile with golang upstream) is the approach taken for https://github.com/vbatts/tar-split. Currently we carry archive/tar but add a functionality of enabling access to raw bytes. See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

(coincidentally I discovered this thread while preparing to open the discussion for pushing the RawBytes feature to upstream)

wking commented 7 years ago

On Mon, Jan 16, 2017 at 01:54:05PM -0800, Vincent Batts wrote:

See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

Does this collect all the bytes (from both the headers and payloads) between RawBytes calls when RawAccounting is set? If so, it sounds a lot like the io.TeeWriter approach 1. Is it just a more conventient API for that behavior, or does it give you a way to trigger on tar headers that are currently elided by the upstream Go library (e.g. pax x typeflag records)?

dsnet commented 7 years ago

It does not seem like the RawBytes API provides a way to trigger on elided headers.

As @wking mention, it seems that the RawBytes API is identical to using a TeeReader around the source io.Reader except the RawBytes does not record the body.

It seems that this could be accomplished with a modified TeeReader approach: https://play.golang.org/p/ZUanaldFcu

However, my modification falls short in the following two areas:

If those are the only two deficiencies (correct me if I'm wrong), I would much rather see the following happen instead:

The first change requires no API change and is something I would want to see anyways. The later API change at least provides benefit for another functionality more popularly requested of the tar package.

dsnet commented 7 years ago

18710 is about adding functionality to tell the Writer what format to use (and conversely for the Reader to report what format was read). @wking, I'm wondering if this will solve your original use-case for this issue where you needed to do validation of the format.

vbatts commented 7 years ago

@dsnet understandable, but the need was in accessing only the raw headers and padding, and not the body of each entry, as the tar.Reader.Read() already provides this. @wking in this way, it does allow for collecting all the bytes. That's the purpose of tar-split, such that all the non-payload bytes can be preserved for re-assembling the cryptographic-equivalent original.

There is nothing for triggering on tar headers. Though a parser of the RawBytes ought not be too terrible.

Additionally, that RawBytes approach does not affect the tar.Writer at all. It is only for reading.