golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
120.96k stars 17.36k forks source link

proposal: archive/zip: extend the visibility of the countWriter #65569

Open grdw opened 4 months ago

grdw commented 4 months ago

Proposal Details

We're currently using a custom build zip writer to "flush" zip headers and the EOCD footer, which naturally for 90% looks identical to the one in writer.go. The use-case for this custom zip writer is to "prepare a zip file" without the need for having the actual data of a file in the actual zip file yet, which allows for streaming of a zip file.

We currently can't use the standard golang zip library because we can't forward the position of the countWriter by hand. Ideally, we'd be able to set w.cw.count without the restriction of the data being written beforehand (so SetOffset() can't be used, unfortunately).

The suggestion here would be to add the following helpers, or some similar functionality to forward the w.cw.count variable without the restriction of SetOffset(), and to read out its value with the following public functions:

// pseudo code:
func (w *Writer) AdvanceOffset(n int64) {
    w.cw.count += n
}

func (w *Writer) GetOffset() int64 {
    w.cw.count
}

This would make the use of the standard golang zip-library useful for our use-case. We would use the Flush() functions as they exist now to get out the intermediate headers and the EOCD footer.

ianlancetaylor commented 4 months ago

That seems pretty special purpose. I struggle to see how anybody else would use this functionality. Is it really worth adding to the standard library?

It also seems to me that you can increment the offset by calling the Write method with a slice of the appropriate size. You could have the underlying writer discard the data, if necessary.

grdw commented 4 months ago

Thanks for the quick reply!

It also seems to me that you can increment the offset by calling the Write method with a slice of the appropriate size.

Correct, that can also be done to solve this specific use-case. The downside is that the files that will be 'squeezed' in-between the zip elements (for lack of a better description) can become quite large in our specific case, and we'll easily talk >500 GB in some cases. To take an extreme - but not uncommon example - doing the work for a 1 TiB file would result in the following code snippet:

package main

import (
    "archive/zip"
    "bytes"
    "fmt"
    "time"
)

func main() {
    fileSize := uint64(1024 * 1024 * 1024 * 1024)
    io := new(bytes.Buffer)
    zipWriter := zip.NewWriter(io)
    w, err := zipWriter.CreateHeader(&zip.FileHeader{
        Name:               "test.mov",
        Modified:           time.Now(),
        CRC32:              25,
        CompressedSize64:   fileSize,
        UncompressedSize64: fileSize,
    })
    if err != nil {
        panic(err)
    }
    // Flush out the header:
    zipWriter.Flush()
    fmt.Printf("Header: %x\n", io.Bytes())
    io.Reset()
    // Flush out the bytes:
    w.Write(make([]byte, fileSize))
    zipWriter.Flush()
    fmt.Printf("Flushed: %d\n", len(io.Bytes()))
    io.Reset()
    zipWriter.Close()
    fmt.Printf("EOCD footer: %x\n", io.Bytes())
}

This is quite slow and memory intense. Not having to do this:

w.Write(make([]byte, fileSize))

... would make our lives a lot easier 😅.