golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

124.24k stars 17.7k forks source link

archive/tar: add Reader.NextRaw method to read only one raw header #17657

Open wking opened 8 years ago

wking commented 8 years ago

The current master implementation elides records, for example pax x typeflag records are collapsed into the following record with the extended headers being merged into the normal file's headers. This is useful for most consumers, who are only interested in consuming the tar content. But two use cases would benefit from more direct access to the lower-level tar details:

Accessing information that is not yet represented in the the tar structure. For example, tar.Header does not currently expose pax extended headers outside of these names and prefixes. There are plans to rectify this limitation (#14472), but until the tar structures expose all of these sorts of details it will be useful to have a way to directly access the header and file data. This will allow users to add their own parsers for any extension they need which is not yet supported by the stdlib, without requiring them to fork archive/tar or replace it with a completely novel parser.
Validating tar files against a particular standard. If a protocol places requirements on the tar elements, users will need more direct access to the tar details to check the tar file against those requirements. For an example along these lines, see opencontainers/image-spec#342 and the extremely naive stub here (which currently doesn't work because Next() never returns the x typeflag headers).

I think we can address both of these cases without breaking the existing API with two changes:

Adding a tar.Reader.NextRaw() that works like tar.Reader.Next(), but always returns the next record regardless of its typeflag.
Adding tar.Header.Bytes exposing a []byte slice of the raw header data, to allow clients to access header fields that are not yet mapped to other tar.Header attributes. Alternatively, this could be a method tar.Header.Bytes() (or an attribute exposing a string or whatever) to make the data clearly read-only.

Folks who wish to control writing at a low level may want similar changes on the tar.Writer side, but I think that is orthogonal enough to punt to a separate issue (if anyone wants it).

CC @dsnet.

dsnet commented 8 years ago

Thanks for filing this. I explored this somewhat and I'm opposed to providing this functionality for the following reasons:

The requested tar.Header.Bytes field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.
What is the meaning of the Bytes field when writing? Do we simply ignore it? Do we write Bytes instead of what the Writer would have generated? We could document the exact behavior, but I'm not fond of a header field that is only valid for the Reader, but not for the Writer. As an API it behaves in unexpected ways.
I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the tar archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.
The desire to validate the tar format seems like a legitimate need, but the proposed API doesn't easily solve this. It seems that your approach was to get low-level access to the header bytes and to validate them yourselves. The workaround I presented allows you to do this.
A better API might be to add a Format field to Header. When reading, the Format will be filled out with a enum representing the format. When writing, Format tells the Writer what format to encode the header using.
- However this is harder than it sounds. The unfortunate nature of Postel's Law is that there are many tar readers that liberally decode non-compliant tar files. Even Go's reader will liberally decode files that are somewhat invalid (See #17665 for an example of this). While "specifications" exist for the USTAR, GNU, and PAX formats, the world has a very liberal conformance to those said "specifications". I have even seen instances of where the canonical implementation of the specification behaves differently from what the specification even says. The reality is: the tar format (with all of its variations) is not a format friendly to formal specifications (and thus not friendly to easy validation).

wking commented 8 years ago

On Sat, Oct 29, 2016 at 02:46:22PM -0700, Joe Tsai wrote:

The requested tar.Header.Bytes field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.

I agree that tar.Header.Bytes is hacky and am open to the TeeWriter approach. However, it's possible that tar.Reader.Next skips elides two headers (e.g. an ‘x’ and a ‘g’). In that case, I'm not sure how you'd find the middle header short of implementing your own parser. You could drop tar.Header.Bytes (in favor of the TeeWriter approach) if you had NextRaw to give single steps.

I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the tar archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.

Forking is so drastic, when NextRaw seems pretty straightforward.

A better API might be to add a Format field to Header… However…

I agree that this is a bad fit for the stdlib for the reasons you give. I think NextRaw and TeeWriter would give sufficient access to graft this sort of thing onto the stdlib's reader. Do you have problems with the NextRaw interface? Would it help if I worked up a patch implementing it?

dsnet commented 8 years ago

I think a Header.Bytes field is off the table given:

It's low number of use-cases
A workaround with io.TeeWriter

An API to add NextRaw is still a possibility, but I have many concerns.

What does NextRaw do when it decodes a old GNU header for a sparse file? Do we parse the extended blocks as well? (I would assume yes, otherwise the next NextRaw call will fail). This seems to break your assumption that "raw" always means 512B.
Let's suppose you decode the raw header for a PAX header, is it now your responsibility to parse the raw key-value PAX headers yourself?
What happens when some value obtained from a previous call to NextRaw is relevant to properly parse in the next call to NextRaw? For example, let's say the next header is for the PAX headers, and the "size" record is set. When we call NextRaw again, how does it know about information from a previous raw header (like the filesize) that is important to parse the next file?
What happens when you intertwine Next and NextRaw calls?

wking commented 8 years ago

On Wed, Nov 09, 2016 at 01:55:26PM -0800, Joe Tsai wrote:

I think a Header.Bytes field is off the table given:

It's low number of use-cases

A workaround with io.TeeWriter

Agreed, although the TeeWriter approach depends on NextRaw or similar 1.

An API to add NextRaw is still a possibility, but I have many concerns…

These are good points. The main issue with Bytes access and tar validation is that we need access to the tip of the entry point. How about:

// Next2 is just like Next, but if callback is non-nil it is called // after parsing a header block (512 bytes after the start of the // record). A single Next2 call may result in multiple callback // calls, e.g. in the case of pax extended headers. If callback // returns an error, Next2 returns the same error and does no // further reading. For records with GNU's "extra" or "sparse" // header extensions, callback will be called before the additional // headers are parsed, which means the values set in the callback's // header argument may be incomplete. If the callback returns with // no errors, Next2 will continue parsing as usual. Callbacks must // not modify the passed tar reader internally unless they can do so // without altering the state of the Reader reference assocaited // with Next2. func (tr _Reader) Next2(callback *HeaderCallback) (_Header, error)

type HeaderCallback(tar Reader, header Header) error

That seems reasonable to me, although the “don't break the reader” warning is a bit awkward. You could avoid that with:

type HeaderCallback(header *Header) error

but folks might want to use the Reader reference in a non-desctuctive way. For example, Python exposes global pax headers on the Reader/Writer 2, and Go might decide to handle global pax headers the same way (as an alternative to packing them all into the same Header map 3). On the other hand, folks can always use a closure or some such with the Header-only form if they do need a reference to the reader, so I'm ok either way.

dsnet commented 8 years ago

Before we dig deeper into this API (which seems fairly complex), I want to understand the use case more. How does Next2 help achieve the goal of "validating the tar file"? Can you give practical and specific things that you would be checking for it to be "valid"?

vbatts commented 7 years ago

a possible option (and one that I would love to reconcile with golang upstream) is the approach taken for https://github.com/vbatts/tar-split. Currently we carry archive/tar but add a functionality of enabling access to raw bytes. See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

(coincidentally I discovered this thread while preparing to open the discussion for pushing the RawBytes feature to upstream)

wking commented 7 years ago

On Mon, Jan 16, 2017 at 01:54:05PM -0800, Vincent Batts wrote:

See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

Does this collect all the bytes (from both the headers and payloads) between RawBytes calls when RawAccounting is set? If so, it sounds a lot like the io.TeeWriter approach 1. Is it just a more conventient API for that behavior, or does it give you a way to trigger on tar headers that are currently elided by the upstream Go library (e.g. pax x typeflag records)?

dsnet commented 7 years ago

It does not seem like the RawBytes API provides a way to trigger on elided headers.

As @wking mention, it seems that the RawBytes API is identical to using a TeeReader around the source io.Reader except the RawBytes does not record the body.

It seems that this could be accomplished with a modified TeeReader approach: https://play.golang.org/p/ZUanaldFcu

However, my modification falls short in the following two areas:

Consuming the extra padding for the body only occurs at Next, which incorrectly causes the TeeReader to read padding that is not part of the Header.
This requires reading the entire body before calling Next, otherwise part of the body is accidentally copied into TeeReader. This is inefficient when the body is a sparse file or the input io.Reader is actually a io.Seeker and data could have been quickly discarded.

If those are the only two deficiencies (correct me if I'm wrong), I would much rather see the following happen instead:

That the Read function guarantees that it consumes the padding when it hits io.EOF. This could be documented behavior with a test ensuring it's persistence in the standard library.
That we add a Reader.Discard method to tell the read to drop some number of bytes. A discard method is useful for the handling of sparse files (see my comment on #13548).

The first change requires no API change and is something I would want to see anyways. The later API change at least provides benefit for another functionality more popularly requested of the tar package.

dsnet commented 7 years ago

18710 is about adding functionality to tell the Writer what format to use (and conversely for the Reader to report what format was read). @wking, I'm wondering if this will solve your original use-case for this issue where you needed to do validation of the format.

vbatts commented 7 years ago

@dsnet understandable, but the need was in accessing only the raw headers and padding, and not the body of each entry, as the tar.Reader.Read() already provides this. @wking in this way, it does allow for collecting all the bytes. That's the purpose of tar-split, such that all the non-payload bytes can be preserved for re-assembling the cryptographic-equivalent original.

There is nothing for triggering on tar headers. Though a parser of the RawBytes ought not be too terrible.

Additionally, that RawBytes approach does not affect the tar.Writer at all. It is only for reading.