golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.99k stars 17.67k forks source link

archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974

Closed saracen closed 3 years ago

saracen commented 5 years ago

If you have a compressed file, know its uncompressed size and crc32 checksum, it'd be nice to be able to add it to an archive as is.

I have three current use-cases for this:

I see three different ways we could achieve this:

  1. Create a zip.CreateHeaderRaw(fh *FileHeader) (io.Writer, error) function, that uses the FileHeader's CRC32 and UncompressedSize64 fields set by the user.
  2. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but have a new FileHeader field that indicates we're going to write already compressed data (and then use CRC32 and UncompressedSize64 fields set by the user)
  3. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

julieqiu commented 5 years ago

/cc @dsnet

gopherbot commented 5 years ago

Change https://golang.org/cl/202217 mentions this issue: archive/zip: support adding raw files

saracen commented 5 years ago

Added change that uses option 1 (CreateHeaderRaw) because it feels weird to use a field in FileHeader solely for archiving as that structure is also used when reading an archive.

ianlancetaylor commented 5 years ago

This is going to be new API, so turning into a proposal.

@dsnet Please add any thoughts you may have about this functionality. Thanks.

escholtz commented 4 years ago

The project I'm working on would appreciate this functionality for performance reasons. We need to replace a single file within a zip. The DEFLATE tax is high and wasteful since we decode/encode without needing or using the intermediate data.

I like @rsc's Copy implementation in https://github.com/rsc/zipmerge. It might not be quite as flexible as the other proposals, but I think it fits well within the existing API and supports efficiently creating a new zip from an existing zip while allowing the user to replace/delete/update specific files.

// Copy copies the file f (obtained from a Reader) into w.
// It copies the compressed form directly.
func (w *Writer) Copy(f *File) error {
    dataOffset, err := f.DataOffset()
    if err != nil {
        return err
    }
    if err := w.closeLastWriter(); err != nil {
        return err
    }

    fh := f.FileHeader
    h := &header{
        FileHeader: &fh,
        offset:     uint64(w.cw.count),
    }
    fh.Flags |= 0x8 // we will write a data descriptor
    w.dir = append(w.dir, h)

    if err := writeHeader(w.cw, &fh); err != nil {
        return err
    }

    r := io.NewSectionReader(f.zipr, dataOffset, int64(f.CompressedSize64))
    if _, err := io.Copy(w.cw, r); err != nil {
        return err
    }

    return writeDesc(w.cw, &fh)
}

Here are the latest compress/flate benchmarks from master at level 5 which archive/zip uses by default (Intel Xeon W-2135 CPU @ 3.70GHz × 6):

goos: linux
goarch: amd64
pkg: compress/flate
BenchmarkDecode/Digits/Level5/1e4-12                6530        164347 ns/op      60.85 MB/s       40598 B/op          7 allocs/op
BenchmarkDecode/Digits/Level5/1e5-12                 960       1131532 ns/op      88.38 MB/s       40842 B/op         13 allocs/op
BenchmarkDecode/Digits/Level5/1e6-12                 117       9673619 ns/op     103.37 MB/s       44800 B/op         79 allocs/op
BenchmarkDecode/Newton/Level5/1e4-12                8076        164681 ns/op      60.72 MB/s       41114 B/op         17 allocs/op
BenchmarkDecode/Newton/Level5/1e5-12                1653        765625 ns/op     130.61 MB/s       44283 B/op         31 allocs/op
BenchmarkDecode/Newton/Level5/1e6-12                 180       6408771 ns/op     156.04 MB/s       61144 B/op        156 allocs/op
BenchmarkEncode/Digits/Level5/1e4-12                5710        215617 ns/op      46.38 MB/s
BenchmarkEncode/Digits/Level5/1e5-12                 370       3232511 ns/op      30.94 MB/s
BenchmarkEncode/Digits/Level5/1e6-12                  32      34852520 ns/op      28.69 MB/s
BenchmarkEncode/Newton/Level5/1e4-12                4774        238334 ns/op      41.96 MB/s
BenchmarkEncode/Newton/Level5/1e5-12                 357       2974790 ns/op      33.62 MB/s
BenchmarkEncode/Newton/Level5/1e6-12                  39      30595764 ns/op      32.68 MB/s
PASS
saracen commented 4 years ago

@escholtz I agree that having a (*Writer) Copy(f *File) would be nice, but as you said, it isn't quite as flexible.

For comparison, copying a file with the patch I've submitted would look something like this:

func Copy(zw *zip.Writer, zf io.ReaderAt, f *zip.File) error {
    offset, err := f.DataOffset()
    if err != nil {
        return err
    }

    w, err := zw.CreateHeaderRaw(&f.FileHeader)
    if err != nil {
        return err
    }

    _, err = io.Copy(w, io.NewSectionReader(zf, offset, int64(f.CompressedSize64)))

    return err
}

Having Copy() would be great, but if for some reason only one solution could be added, I'd prefer it to be something along the lines of CreateHeaderRaw.

For the other use cases originally mentioned, I've written https://github.com/saracen/fastzip that takes advantage of CreateHeaderRaw also (but I've had to vendor archive/zip and include my modification for it to work).

klauspost commented 4 years ago

Implemented in https://github.com/klauspost/compress/pull/214

rsc commented 4 years ago

I vaguely remember discussing this with @rogpeppe years ago, but I can't find the discussion. Any clearer memory, Roger?

rsc commented 4 years ago

My memory is coming back. I think what I remember @rogpeppe and I talking about was the ability to modify an existing zip file by appending new files to it.

saracen commented 4 years ago

@rsc https://groups.google.com/g/golang-dev/c/ZSubqlF2G4k?pli=1 this might be the discussion, and yeah, was related to appending.

rsc commented 4 years ago

Adding a whole new Create method just to set this extra bit seems unfortunate - what if there is a second bit we want control over later? Add four new methods? And so on.

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

saracen commented 4 years ago

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

@rsc I originally preferred this too. The reason I didn't for my implementation is because FileHeader is also used when reading an archive too. It felt odd having a WriteRaw bool as part of that structure in the context of reads. Is this not an issue?

rsc commented 4 years ago

@saracen it seems like less of an issue than an exponential blowup of create methods. The NonUTF8 field is probably most analogous, although it does get set in the reader.

I suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well. There at least we could add an OpenRaw method to zip.File. It's much harder to do that in what CreateHeader returns, because it's an io.Writer.

I suppose another way to store compressed files would be to register a no-op compressor.

Another way would be to have a method on zip.Writer: WriteRaw(bool) to set whether the compressors should be avoided. You'd z.WriteRaw(true) before calling CreateHeader.

Does any of these sound like a winner to anyone?

saracen commented 4 years ago

suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well.

@rsc At the moment, I believe there's two ways to access the raw bytes:

I suppose another way to store compressed files would be to register a no-op compressor.

This sounds quite good, especially since it's similar to how you can handle raw reads by registering a decompressor.

Using a zip.NoOpCompressor would need to mean that the CompressedSize64 and CRC32 fields of the header are left unmodified when the header is closed:

zw := zip.NewWriter(archive)
zw.RegisterCompressor(zip.Deflate, zip.NoOpCompressor)

hdr := &zip.FileHeader{
  Name: "file.txt",

  // the following fields would be left untouched because
  // the writer detects that zip.NoOpCompressor is being used
  CRC32: 0xffffffff,
  CompressedSize64: 1024,
}

w, _ := zw.CreateHeader(hdr) (io.Writer, error)

// bytes are copied to the zip archive unmodified
io.Copy(w, buf)
rsc commented 4 years ago

The problem with the no-op compressor is that the uncompressed size would be wrong. That would have to be left alone. It's probably a little too subtle.

Given that you can already read the compressed data directly, as @saracen points out, maybe it's fine to focus on writing.

What if pre-filling CompressedSize64 (> 0) in the FileHeader means that the writes are expected to be pre-compressed? In that case CRC32 has to be filled in too, although zero is a valid CRC32 so we can't double-check that. But I believe 0 is not a valid CompressedSize64, and in any event it's not an interesting one.

In essence, CompressedSize64 = 0 means "I don't know, you compress it for me".

Edit: I see that this was @saracen's suggestion 3 in the original comment at the top.

rsc commented 4 years ago

@saracen, @escholtz, @klauspost, any thoughts about the most recent suggestion (previous comment)?

dsnet commented 4 years ago

According RFC1951, the compression stream requires at least one block (with the BFINAL bit set). Therefore, it is impossible to have a compressed DEFLATE stream that is zero-bytes in length.

However, the ZIP file format allows for a number of different compression methods to be used (DEFLATE happens to be the most common by far). I think we should think about 1) whether we want to support other compression methods, and 2) whether we ever want to support other compression format that may theoretically have a zero-length output for an empty input.

escholtz commented 4 years ago

I've been hesitant to comment because I didn't want to hijack @saracen's original proposal.

For my use case, the func (w *Writer) Copy(f *File) error implementation would be sufficient. I think it fits cleanly with the existing package exports and still maintains simplicity.

I agree that having both CreateHeader and CreateHeaderRaw does not feel like a clean standard library package.

Is there precedent in the standard library for using the CompressedSize64 approach?

saracen commented 4 years ago

For my use case, func (w *Writer) Copy(f *File) error doesn't help, but I can see its usefulness and how it relates this this problem.

@rsc In my original proposal, I'd written:

  1. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this.

escholtz commented 4 years ago

@saracen I took another look at the three examples you originally provided and I think they could all be achieved with a Copy method?

  • Repackaging a zip file, removing or adding files, without incurring the associated compression overheads for files that already exist (somewhat achieving #15626)

Using existing zip file(s), don't Copy files you want to remove. Copy existing files you want to keep. Use existing methods to add new files.

  • Compress first and include later based on whether the compressed size was smaller than the original, without having to perform the compression twice.

Write a new zip file, read the compressed size, only Copy if it's smaller.

  • Support concurrent compression: compress many files concurrently and then copy the already compressed files to the archive (similar to Apache Commons Compress' ParallelScatterZipCreator)

Write multiple zip files concurrently and then Copy files from each into a single zip file.


Perhaps this approach isn't as elegant and has slightly more garbage/performance overhead but it avoids the tax of repeatedly compressing the same file which I would expect is the bottleneck for most cases.

dsnet commented 4 years ago

Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this.

For all Go code inside Google, I only found one instance of someone implicitly setting CompressedSize. The usage pattern looked like the following:

rc, ... := rf.Open()
fileHeader := rf.FileHeader
wc, ... := w.CreateHeader(&fileHeader)
... io.Copy(wc, rc)

Essentially the code is copying one Zip file to another.

If we go with the semantics that a positive CompressedSize means to interpret the input as already compressed, then the pattern above will break since io.Copy will read uncompressed data from rc and write it to wc when the latter is expected compressed data.

We could "fix" this by adding a ReadFrom or WriteTo method that specially handles copying of compressed data from one zip.File to another. However, that approach seems a bit too magical and I can see other edge cases where it fails.

rsc commented 4 years ago

@dsnet, I don't think we have to fix this, but we could at least detect the problem: if the alg is deflate and we expected compressed data, we can at least check that the first few bytes are a proper deflate header. That would make the io.Copy return an error. But it is a little concerning.

dsnet commented 4 years ago

proper deflate header.

Unfortunately DEFLATE doesn't really have any distinguishable magic numbers unlike GZIP, which typically has a 10-byte sequence that is readily identifiable. It seems that the best we could do is decompress some number of bytes and assume the input is valid if it passes that short test.

rsc commented 4 years ago

It sounds like I should look in the Go corpus I have for other code using this pattern. I'll try to do that for next week. Maybe overloading the CompressedSize here is too subtle.

rsc commented 4 years ago

Will try for next time.

escholtz commented 4 years ago

It feels like we're loosing interest in this proposal and should draw a conclusion.

I don't think overloading CompressedSize is a good approach.

I still like the Copy approach that @rsc chose to use in the zipmerge package. I think all 3 use cases in the original proposal can be solved with a Copy method.

rsc commented 4 years ago

@escholtz, I haven't yet done the scan, but it doesn't seem like it's necessary - it's a plausible example that we'd break.

Re https://github.com/golang/go/issues/34974#issuecomment-659518287, the one thing that would still be good to support would be if you have the header bits in hand along with the compressed form, but you don't have them in an already-existing zip file. (Maybe you want more control over the compression, or you have DEFLATE or an alternate compressed form already from a different archive file, or something like that.)

After rereading this thread now with a bit more distance, I think the fundamental operation is really just access to raw data in both reading and writing. I suggest we add those directly, along with the Copy helper:

// like CreateHeader but the data written to the io.Writer should be precompressed
func (*Writer) CreateRaw(file *FileHeader) (io.Writer, error)

// like Open but the data read back is the compressed form
func (*File) OpenRaw() (io.Reader, error)

// helper - does f.OpenRaw, w.CreateRaw, io.Copy
func (w *Writer) Copy(f *File) error 
escholtz commented 4 years ago

That sounds reasonable. Looking at the package, I assumed this type of access was intentionally not exported for simplicity and reliability. If we're ok with making the package a little less simple, this would enable the advanced use cases we're trying to achieve.

im7mortal commented 4 years ago

Hi everybody! :smile: I was thinking about that for couple of months. After long thinking I created my solution. I was dreaming to put my ideas in live when I found out this issue :laughing: So far I had the same 2 problems in mind

  1. How to copy files without Decompression-Compression.
  2. Precompress multiply files. Or compress parallelly

I had different iterations in my mind. First there was something like @saracen mentioned with the similar looking ReadCompres method. It was dirty and monstrous. So far after long thinking I created an ABSOLUTE copy of Copy method from here :laughing: :laughing: :laughing: I even did the same optimization with writeDesc func. But I had my genius moment :smile: when I discovered that both tasks could be done only with Copy method.

I am happy that I still can add my penny here and probably resolve this discussion about blowing up methods.

We don't need CreateRaw(file *FileHeader) method. And probably OpenRaw() (io.Reader, error) also. It's easy resolved only with Copy

Here is the idea


func main() {
       // final writer
    zw := zip.NewWriter(bytes.NewBuffer([]byte{}))
       // get precompressed zip.File
    prFile := Precompress(data)
       // copy precompressed zip.File
    zw.Copy(prFile)

}

func Precompress(b []byte) zip.File {
    // create tmp zip writer
    buf := bytes.NewBuffer([]byte{})
    zw := zip.NewWriter(buf)
       // write/compress data in
    zwf, _ := zw.Create("test")
    zwf.Write(b)
    // extract file from tmp zip
    zr,_  := zip.NewReader(readerAt(buf.Bytes()), int64(buf.Len()))
    return zr.File[0]
}

In this way all compression, descriptor validation logic is still encapsulated(!). And there are no overheads (I think) . We compress data to buffer then we do io.Copy to the destination reader. It will be the same in case of compressing it directly.

In case if user want to copy couple of files from big zip and keep them in memory. It's again can be done with Copy. User copy needed files to the tmp zip and then they needed it copy it to destination zip.

I didn't think about the OpenRaw() method before. There are no much use cases so far. If user need to keep only 1 file in compressed state then them can use tmp zips. Use case can be when an user want to extract raw file and send it directly over http to the client :thinking: . In my first iterations I achieved it through Compressor interface. I copied data to the buffer and ignored error from CRC2 check. It worked pretty well for long time. I will post code here when I find it.

Conclusion

  1. The Copy method is critical
  2. Other methods can be implemented through wrappers without overheads
saracen commented 4 years ago

I suggest we add those directly, along with the Copy helper:

@rsc

@klauspost's compress library added both CreateHeaderRaw and Copy. Obviously, this is an external library, but it would be nice if the API continued to match.

@im7mortal Using just Copy does work and @escholtz suggested this too. It's a good solution for many problems, but it relies on either your precompressed data being in a zip file or you first copying it to one. Copying precompressed data from a different archive format would require a temporary copy to a new zip file, only to be read back again.

rsc commented 4 years ago

Personally, I'd prefer CreateHeaderRaw over CreateRaw, I think it's clearer that it's a variant of CreateHeader.

It seems unnecessarily long. There's no way to have CreateRaw that doesn't also take a header.

dsnet commented 4 years ago

If we plan on adding a CreateHeaderRaw, I suspect that we'll need to pay close attention to #22520. We'll want to make sure we get the semantics right from the start, otherwise they will be very hard to change later on.

rsc commented 4 years ago

Thanks for the pointer. I don't see why they are necessarily related. CreateRaw should do the same as CreateHeader, just preparing for raw writes. It shouldn't mix in other different semantics.

For that issue, I commented there but I think it would be fine to add a new field that CreateHeader (and then by extension CreateRaw) would use. I think that can be an independent decision though.

im7mortal commented 4 years ago

@saracen

... but it relies on either your precompressed data being in a zip file or you first copying it to one. Copying precompressed data from a different archive format would require a temporary copy to a new zip file, only to be read back again.

It's not avoidable in both cases. It's just the same but CreateHeaderRaw discards all incapsulated logic. Instead an user must implement it themselves.

Precompression with CreateHeaderRaw

  1. Create crc32 checksum wrapper.
  2. Create byte buffer
  3. Create compressor writer
  4. Write data to the compressor writer through crc32 wrapper
  5. Create raw header zip.CreateHeaderRaw. Ensure crc32 and precompressed size is correctly set :exclamation:
  6. Do copy from buffer(2) to writer(5)
func preompress(b []byte) (check uint32, c []byte, err error) {
    wsumm := crc32.NewIEEE() // an user must ensure correct check sum
    _, err = wsumm.Write(b)
    if err != nil {
        return
    }
    buff := bytes.NewBuffer([]byte{})
    wcmp, err := flate.NewWriter(buff, flate.DefaultCompression) // MUST MATCH with actual compression method
    if err != nil {
        return
    }
    _, err = wcmp.Write(b)
    if err != nil {
        return
    }
    return wsumm.Sum32(), buff.Bytes(), err
}

...
    w := zip.NewWriter(writer)
    crc32_, compressed, err := preompress(data)
    if err != nil {
        return
    }
    wf, err := zip.CreateHeaderRaw(zip.FileHeader{
        Name:               "tmp",
        CRC32:              crc32_,  // an user must ensure correct check sum
        // an user must ensure correct uncompressed size
        // it requires additional logic in case of the pipe writer. CRC32 also
        UncompressedSize64: uint64(len(data)),  
        Method:             zip.Deflate,  // MUST MATCH with actual compression method
    })
    if err != nil {
        return
    }
    _, err = wf.Write(compressed)
    if err != nil {
        return
    }
...

:x: Need wrapper :x: Additional Write-Read-Write :x: Unencapsulated. User must implement crc32 sum and propagate it with length correctly :x: Unencapsulated. of the zip.Decompressor is unused. User can use different method from which set in the header. :x: Complex wrapper in case of the pipe writer :x: Allow have corrupted file in the archive :warning: Can cause crash of receiver zip libraries :exclamation:

Precompress through tmp zip

  1. Create tmp zip writer
  2. Write data to the tmp zip. crc32 and byte counters are encapsulated
  3. Extract zip.File
  4. Perform zipWriter.Copy. Which do io.Copy

In both cases there is need to write wrappers.

func PrecompressInZip(b []byte, name string, method uint16) (zf zip.File, err error) {
    // create tmp zip writer
    buf := bytes.NewBuffer([]byte{})
    zw := zip.NewWriter(buf)
    // write/compress data in
    zwf, err := zw.CreateHeader(&zip.FileHeader{
        Name: name,
        Method: method,
    })
    if err != nil {
        return
    }
    _, err = zwf.Write(b)
    if err != nil {
        return
    }
    // extract file from tmp zip
    zr, err := zip.NewReader(readerAt(buf.Bytes()), int64(buf.Len()))
    return zr.File[0], err
}
...
    // final writer
    zw := zip.NewWriter(bytes.NewBuffer([]byte{}))
    // get precompressed zip.File
    prFile, err := Precompress(data, zip.Deflate)
    if err != nil {
        return
    }
    // copy precompressed zip.File
    zw.Copy(prFile)
...

:x: Need wrapper :x: Additional Write-Read-Write :warning: add directoryHeaderLen=46 bytes of metadata (it vary but very small) :green_book: CRC32 summ and length are encapsulated and ensured by very good logic :green_book: Unencapsulation of the zip.Decompressor is used fully. The writer use only corresponding compressor. No way to mix up! :green_book: Pipe wrappers do not require additional logic and can be handled through std libs :green_book: Don't allow have corrupted file in the archive.

Conclusion @rsc The CreateHeaderRaw is very error prone. All encapsulated logic from zip library must be reimplemented by user.

escholtz commented 4 years ago

I agree with @im7mortal on this.

I think most gophers using the archive/zip package are looking for simplicity. They don't care about the specifics of the zip format, they just want to be able to read and write files.

A much smaller number of gophers are looking for a zip package that exports all of the guts that archive/zip currently hides. Building blocks to make it easier to build a custom / optimized zip package.

I think the standard library should lean more towards the simple case and not the building blocks case. I worry if we try to do both, we end up with a muddled package that is more complicated and confusing for the typical consumer.

I still don't know if Copy should be added (even though I personally want it so I don't have to maintain a fork). On the one hand, it maintains safety and doesn't add too much complexity. On the other hand, it's only necessary to support less common use cases. And it still doesn't support advanced use cases (copying from other archive formats, although I would suspect this is even more rare than copying between zips).

saracen commented 4 years ago

@im7mortal

It's not avoidable in both cases. It's just the same but CreateHeaderRaw discards all incapsulated logic.

There are cases where this is avoidable. You can have compressed data, know the uncompressed size and either already have the crc32 checksum of the uncompressed data or calculating it is still much faster than decompressing and recompressing.

The compressed data doesn't even need to be DEFLATE. This package already supports vendor compression methods, and there's certainly use-cases where I might want to move pre-existing compressed data (say zstd) into the zip structure, over another archive format.

The whole point of CreateRaw is to give more control over this and to go around the safety offered by the existing functionality. The zip package already allows for this to a certain degree, and it's already possible to create corrupt/invalid archives accidentally.

@escholtz Regarding simplicity, this is difficult to not agree with.

Supporting vendor compression methods and arbitrary extra data are also features that are not going to be used by everybody, but they've been added. I personally don't think that CreateRaw is going to confuse users or take away from the simplicitiy of the more common functions provided, but I understand your point.

im7mortal commented 4 years ago

@saracen

There are cases where this is avoidable. You can have compressed data, know the uncompressed size and either already have the crc32 checksum of the uncompressed data or calculating it is still much faster than decompressing and recompressing.

... into the zip structure, over another archive format.

I am agree that there are cases. I think we could consult with #15626 (Delete, Append, Replace methods).

The most common use cases parallel compress, precompress, copy between zips can use Copy with negligible overhead.

rsc commented 4 years ago

@im7mortal, you are right that much can be done with Copy. That's why I included it in the suggestion in https://github.com/golang/go/issues/34974#issuecomment-701705168 to add CreateRaw, OpenRaw, and Copy (all three), even though technically Copy can be implemented in terms of the other two.

In contrast, dropping the other two makes the use cases that need that low-level access fairly awkward and non-obvious.

It doesn't seem harmful to have these two important building blocks alongside the higher-level helper.

rsc commented 4 years ago

Does anyone object to adding CreateRaw, OpenRaw, and Copy?

rsc commented 4 years ago

Based on the discussion above, this seems like a likely accept. This is subtle, though, so it would make sense to put off until Go 1.17 at this point.

rsc commented 4 years ago

No change in consensus, so accepted.

hherman1 commented 4 years ago

Maybe I missed this, but why not just open the file directly instead of using zip.Open if you want to write/read bytes directly?

ianlancetaylor commented 4 years ago

@hherman1 This proposal is about working with compressed files that are contained within zip archives. It's not about working with the zip archive itself.

gopherbot commented 3 years ago

Change https://golang.org/cl/312310 mentions this issue: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy

escholtz commented 3 years ago

When implementing this in https://golang.org/cl/312310 one challenge I discovered was figuring out how we should handle the trailing data descriptor (https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor) for the OpenRaw and CreateRaw methods. @ianlancetaylor also raised concern in the review for good reason.

I chose to return the data descriptor in the OpenRaw reader and give the CreateRaw writer responsibility for calculating and appending the data descriptor when necessary. This seemed optimal in terms of giving the caller full control, which is the intention of this change, while also conforming to the interface we agreed to. And it also ensures the data descriptor isn't lost by the Copy method.

Perhaps I'm overlooking a different/better approach or perhaps we want to change the interface to allow the same data descriptor reading/writing functionality but in a safer way.

What are your thoughts?

saracen commented 3 years ago

For me, OpenRaw/CreateRaw was mostly about providing raw access to the file data and not modifying the fields that I provide in the header on my behalf. I still expected it to serialize the header data in the same way.

I don't know whether it's needed, but I understand the reason for adding it.

escholtz commented 3 years ago

The data descriptor is part of the spec. We need to support it for the raw methods. The CRC32 and uncompressed size fields are not things we can compute within the raw methods. So we need to allow the caller to specify them.

The 3 data descriptor fields are also in the FileHeader but I don't think it's a good idea to overload them by copying the data descriptor fields into the FileHeader. Since it's a separate segment it should be treated as such.

The only other option I can think of is adding a DataDescriptor struct within the archive/zip package that handles marshalling and unmarshalling of the byte segment. But since the Raw methods are intended for advanced/rare use cases, I'm fine with allowing the caller to handle the marshalling and unmarshalling.

@rsc Are you ok with this approach (returning the data descriptor bytes from OpenRaw and requiring the caller to write the data descriptor bytes to the CreateRaw Writer)?

ianlancetaylor commented 3 years ago

I think it's going to be error-prone to attach the data descriptor bytes to the reader. I suggest:

// DataDescriptor holds the data descriptor that optionally follows the file contents in the zip file.
type DataDescriptor struct {
    CRC32 uint32
    CompressedSize uint64
    UncompressedSize uint64
}

// OpenRaw is like Open but reads the raw, compressed, data from the file.
// If dataDescriptor is not nil, it contains the contents of the optional data descriptor following the file contents.
func (*File) OpenRaw() (r io.Reader, dataDescriptor *DataDescriptor, err error)

// CreateRaw is like CreateHeader but the data written to the io.Writer should already be compressed.
// If dataDescriptor is not nil, it will be written as the optional data descriptor following the file contents.
func (*Writer) CreateRaw(file *FileHeader, dataDescriptor *DataDescriptor) (io.Writer, error)

@dsnet Any thoughts?

saracen commented 3 years ago

The 3 data descriptor fields are also in the FileHeader but I don't think it's a good idea to overload them by copying the data descriptor fields into the FileHeader. Since it's a separate segment it should be treated as such.

@escholtz We serialize the known crc32, uncompressed and compressed size fields from FileHeader when we write the local file header. What benefit is there to not doing the same in the data descriptor? Would they ever be different?

escholtz commented 3 years ago

For the CreateRaw case, if the FileHeader Data Descriptor 0x8 Flag is not set, but a non-nil dataDescriptor argument is passed, would we treat that as an error?

And if the CRC32, CompressedSize, and UncompressedSize fields are non-zero in the FileHeader and a non-nil dataDescriptor argument is passed, would we treat that as an error?