reader: Add option SplitStream

LSChyi commented 1 year ago

Hi folks, I propose to add an option SplitStream in the ReaderConfig. When this option is enabled, the Read call will not try to append content from next stream. When reading to the end of the current xz stream, the Read call would return an error EOS (End of Stream).

There are some daily use cases, such as people could concatenate multiple .tar.xz files into one .xz file. Without giving callers an indication that the current stream is drained, they cannot feed the correct content to the tar reader.

Users could run following command to create a concatenated .xz files from multiple tar.xz files:

$ echo 'test1' > test1
$ echo 'test2' > test2
$ tar Jcvf test1.tar.xz test1
$ tar Jcvf test2.tar.xz test2
$ cat test1.tar.xz test2.tar.xz > combined.xz

, then decompress the combined xz file with

$ tar ixvf combined.xz

ulikunitz commented 1 year ago

While I have to test it, I believe you can achieve the same by using the SingleStream flag. The only difference would be that you would get the error io.EOF.

LSChyi commented 1 year ago

Thank you for the suggestion, however, I get the error xz: unexpected data after stream with enabling the flag SingleStream, and with the first content returned, but cannot get contents in the second stream.

Here is the code snippet, hope I did not so something stupid:

package main

import (
        "bytes"
        "io"
        "log"
        "os"

        "github.com/ulikunitz/xz"
)

func main() {
        content, err := os.ReadFile("combined.xz")
        if err != nil {
                log.Fatal(err)
        }

        reader, err := xz.ReaderConfig{SingleStream: true}.NewReader(bytes.NewBuffer(content))
        if err != nil {
                log.Fatal(err)
        }

        for i := 0; i < 2; i++ {
                buf := new(bytes.Buffer)
                if _, err := io.Copy(buf, reader); err != nil {
                        log.Println(err)
                }
                log.Println(string(buf.Bytes()))
        }
}

and the output of running the snippet:

2023/06/12 19:42:54 xz: unexpected data after stream
2023/06/12 19:42:54 test1000644 000765 000024 00000000006 14441601250 012470 0ustar00lschyistaff000000 000000 test1

2023/06/12 19:42:54 xz: unexpected data after stream
2023/06/12 19:42:54

ulikunitz commented 1 year ago

Hi, if you don't use SingleStream you can read both tar files in the following way:

// TestCombined addresses issue https://github.com/ulikunitz/xz/pull/54
func TestReadCombinedStream(t *testing.T) {
    const file = "testdata/combined.tar.xz"
    f, err := os.Open(file)
    if err != nil {
        t.Fatalf("os.Open(%q) error %s", file, err)
    }
    defer f.Close()
    r, err := xz.NewReader(f)
    if err != nil {
        t.Fatalf("xz.NewReader error %s", err)
    }
    defer r.Close()

    br := bufio.NewReader(r)

    files := 0
    tr := tar.NewReader(br)
    for {
        h, err := tr.Next()
        if err != nil {
            if err == io.EOF {
                break
            }
            t.Fatalf("tr.Next error %s", err)
        }
        files++
        t.Logf("header: %s", h.Name)
    }

    // We have to jump over zero bytes. Option -i of tar.
loop:
    for {
        p, err := br.Peek(1024)
        if err != nil {
            t.Fatalf("br.Peek(%d) error %s", 1024, err)
        }
        for i, b := range p {
            if b != 0 {
                br.Discard(i)
                break loop
            }
        }
        br.Discard(1024)
    }

    tr = tar.NewReader(br)
    for {
        h, err := tr.Next()
        if err != nil {
            if err == io.EOF {
                break
            }
            t.Fatalf("tr.Next error %s", err)
        }
        files++
        t.Logf("header: %s", h.Name)
    }
    if files != 2 {
        t.Fatalf("read %d files; want %d", files, 2)
    }
}

ulikunitz commented 1 year ago

This is using the rewrite branch. Any updates I will do there. We need to close now the readers, because the reader supports parallel decompression.

Use go get github.com/ulikunitz/xz@v0.6.0-alpha.2. If you have any issues let me know.

ulikunitz commented 1 year ago

Many thanks for pointing the issue out. I have fixed the behavior for SingleStream in v0.6.0-alpha.3 and added an additional test case. Since it changes behavior of the code, I will not publish it in v0.5, but in v0.6.0 we will have the more reasonable behavior. Here is the new test code using the SingleStream flag in v0.6.0.

The workaround is the approach I pointed out above. It should work in v0.5 as well. I will close the pull request.

func TestReadSingleStream(t *testing.T) {
    const file = "testdata/combined.tar.xz"
    f, err := os.Open(file)
    if err != nil {
        t.Fatalf("os.Open(%q) error %s", file, err)
    }
    defer f.Close()

    cfg := xz.ReaderConfig{SingleStream: true}
    r, err := xz.NewReaderConfig(f, cfg)
    if err != nil {
        t.Fatalf("xz.NewReaderConfig(f, %+v) error %s", cfg, err)
    }
    defer r.Close()

    files := 0
    tr := tar.NewReader(r)
    for {
        h, err := tr.Next()
        if err != nil {
            if err == io.EOF {
                break
            }
            t.Fatalf("tr.Next error %s", err)
        }
        files++
        t.Logf("header: %s", h.Name)
    }
    // we need to read trailing zeros
    n, err := io.Copy(io.Discard, r)
    t.Logf("%d bytes discarded", n)
    if err != nil {
        t.Fatalf("io.Copy(io.Discard, r) error %s", err)
    }

    r, err = xz.NewReaderConfig(f, cfg)
    if err != nil {
        t.Fatalf("xz.NewReaderConfig(f, %+v) error %s", cfg, err)
    }
    defer r.Close()

    tr = tar.NewReader(r)
    for {
        h, err := tr.Next()
        if err != nil {
            if err == io.EOF {
                break
            }
            t.Fatalf("tr.Next error %s", err)
        }
        files++
        t.Logf("header: %s", h.Name)
    }

    if files != 2 {
        t.Fatalf("read %d files; want %d", files, 2)
    }
}

LSChyi commented 1 year ago

I see, but I think it still worth discussing adding the option SplitStream, and still keep the SingleStream ignores following content if there are multiple streams in it because:

From BSD xz man page, and Linux man page, --single-stream option clearly specifies that:

Decompress only the first .xz stream, and silently ignore possible remaining input data following the stream.
The SingleStream option in v0.5 also clearly specifies that "The SingleStream parameter requests the reader to assume that the underlying stream contains only a single stream.", so it keeps the same behavior and is backward compatible.
Since the rewrite branch would create the config object via a factory function, we can validate the config if incompatible options are set (e.g. both SplitStream and SingleStream are enabled).
The option changes the behavior only when it is true, and since the zero value is false, it is totally backward compatible, and the code change is similar in this PR (adding the if statement before the continue line in ).

I can help adding the test and the option in the rewrite branch, what do you think?

ulikunitz commented 1 year ago

Again thanks for point out the issue and even for the offer to help out. I will mention your contribution in the release notes for v0.6.0.

Here are my reasons to keep the behavior I implemented now for v0.6.

I think implementing the new functionality for the SingleStream option is actually supporting the Linux man page for --single-stream, because it requires to silently ignore possible remaining input data. The old implementation did not do so, because it didn't ignore remaining data. The old behavior can be easily implemented by clients themselves easily and we don't need two different options for it.
You are right the old behavior is documented. This documentation has been changed.
The changes between v0.5 and v0.6 are quite significant so that backward compatibility is not really an issue.
A paragraph from section 2, Overall Structure of .xz File, of the XZ format specification is another argument, why the v0.5 behavior was wrong. The new behavior actually supports the spec, the old behavior couldn't.

In contrast to standalone .xz files, when the .xz file format is used as an internal part of some other file format or communication protocol, it usually is expected that the decoder stops after the first Stream, and doesn't look for Stream Padding or possibly other Streams.

Needing a check that the options are not contradicting, shows actually that the two options would add complexity without major benefits. As I stated checking for extra data is trivial.
Rob Pike stated in his talk introducing Go the following as one design principle-

Keep concepts orthogonal. A few orthogonal features work better than a lot of overlapping ones.

The two options would be definitely overlapping.

Many thanks again for reporting the issue. I would not have identified it myself. I'm convinced the resolution changing the behavior of the SingleStream option is the right one.

ulikunitz / xz

reader: Add option SplitStream #54