Closed LSChyi closed 1 year ago
While I have to test it, I believe you can achieve the same by using the SingleStream flag. The only difference would be that you would get the error io.EOF.
Thank you for the suggestion, however, I get the error xz: unexpected data after stream
with enabling the flag SingleStream
, and with the first content returned, but cannot get contents in the second stream.
Here is the code snippet, hope I did not so something stupid:
package main
import (
"bytes"
"io"
"log"
"os"
"github.com/ulikunitz/xz"
)
func main() {
content, err := os.ReadFile("combined.xz")
if err != nil {
log.Fatal(err)
}
reader, err := xz.ReaderConfig{SingleStream: true}.NewReader(bytes.NewBuffer(content))
if err != nil {
log.Fatal(err)
}
for i := 0; i < 2; i++ {
buf := new(bytes.Buffer)
if _, err := io.Copy(buf, reader); err != nil {
log.Println(err)
}
log.Println(string(buf.Bytes()))
}
}
and the output of running the snippet:
2023/06/12 19:42:54 xz: unexpected data after stream
2023/06/12 19:42:54 test1000644 000765 000024 00000000006 14441601250 012470 0ustar00lschyistaff000000 000000 test1
2023/06/12 19:42:54 xz: unexpected data after stream
2023/06/12 19:42:54
Hi, if you don't use SingleStream you can read both tar files in the following way:
// TestCombined addresses issue https://github.com/ulikunitz/xz/pull/54
func TestReadCombinedStream(t *testing.T) {
const file = "testdata/combined.tar.xz"
f, err := os.Open(file)
if err != nil {
t.Fatalf("os.Open(%q) error %s", file, err)
}
defer f.Close()
r, err := xz.NewReader(f)
if err != nil {
t.Fatalf("xz.NewReader error %s", err)
}
defer r.Close()
br := bufio.NewReader(r)
files := 0
tr := tar.NewReader(br)
for {
h, err := tr.Next()
if err != nil {
if err == io.EOF {
break
}
t.Fatalf("tr.Next error %s", err)
}
files++
t.Logf("header: %s", h.Name)
}
// We have to jump over zero bytes. Option -i of tar.
loop:
for {
p, err := br.Peek(1024)
if err != nil {
t.Fatalf("br.Peek(%d) error %s", 1024, err)
}
for i, b := range p {
if b != 0 {
br.Discard(i)
break loop
}
}
br.Discard(1024)
}
tr = tar.NewReader(br)
for {
h, err := tr.Next()
if err != nil {
if err == io.EOF {
break
}
t.Fatalf("tr.Next error %s", err)
}
files++
t.Logf("header: %s", h.Name)
}
if files != 2 {
t.Fatalf("read %d files; want %d", files, 2)
}
}
This is using the rewrite branch. Any updates I will do there. We need to close now the readers, because the reader supports parallel decompression.
Use go get github.com/ulikunitz/xz@v0.6.0-alpha.2
. If you have any issues let me know.
Many thanks for pointing the issue out. I have fixed the behavior for SingleStream in v0.6.0-alpha.3 and added an additional test case. Since it changes behavior of the code, I will not publish it in v0.5, but in v0.6.0 we will have the more reasonable behavior. Here is the new test code using the SingleStream flag in v0.6.0.
The workaround is the approach I pointed out above. It should work in v0.5 as well. I will close the pull request.
func TestReadSingleStream(t *testing.T) {
const file = "testdata/combined.tar.xz"
f, err := os.Open(file)
if err != nil {
t.Fatalf("os.Open(%q) error %s", file, err)
}
defer f.Close()
cfg := xz.ReaderConfig{SingleStream: true}
r, err := xz.NewReaderConfig(f, cfg)
if err != nil {
t.Fatalf("xz.NewReaderConfig(f, %+v) error %s", cfg, err)
}
defer r.Close()
files := 0
tr := tar.NewReader(r)
for {
h, err := tr.Next()
if err != nil {
if err == io.EOF {
break
}
t.Fatalf("tr.Next error %s", err)
}
files++
t.Logf("header: %s", h.Name)
}
// we need to read trailing zeros
n, err := io.Copy(io.Discard, r)
t.Logf("%d bytes discarded", n)
if err != nil {
t.Fatalf("io.Copy(io.Discard, r) error %s", err)
}
r, err = xz.NewReaderConfig(f, cfg)
if err != nil {
t.Fatalf("xz.NewReaderConfig(f, %+v) error %s", cfg, err)
}
defer r.Close()
tr = tar.NewReader(r)
for {
h, err := tr.Next()
if err != nil {
if err == io.EOF {
break
}
t.Fatalf("tr.Next error %s", err)
}
files++
t.Logf("header: %s", h.Name)
}
if files != 2 {
t.Fatalf("read %d files; want %d", files, 2)
}
}
I see, but I think it still worth discussing adding the option SplitStream
, and still keep the SingleStream
ignores following content if there are multiple streams in it because:
--single-stream
option clearly specifies that:
Decompress only the first .xz stream, and silently ignore possible remaining input data following the stream.
SingleStream
option in v0.5 also clearly specifies that "The SingleStream parameter requests the reader to assume that the underlying stream contains only a single stream.", so it keeps the same behavior and is backward compatible.SplitStream
and SingleStream
are enabled).continue
line in ).I can help adding the test and the option in the rewrite branch, what do you think?
Again thanks for point out the issue and even for the offer to help out. I will mention your contribution in the release notes for v0.6.0.
Here are my reasons to keep the behavior I implemented now for v0.6.
I think implementing the new functionality for the SingleStream option is actually supporting the Linux man page for --single-stream
, because it requires to silently ignore possible remaining input data. The old implementation did not do so, because it didn't ignore remaining data. The old behavior can be easily implemented by clients themselves easily and we don't need two different options for it.
You are right the old behavior is documented. This documentation has been changed.
The changes between v0.5 and v0.6 are quite significant so that backward compatibility is not really an issue.
A paragraph from section 2, Overall Structure of .xz File, of the XZ format specification is another argument, why the v0.5 behavior was wrong. The new behavior actually supports the spec, the old behavior couldn't.
In contrast to standalone .xz files, when the .xz file format is used as an internal part of some other file format or communication protocol, it usually is expected that the decoder stops after the first Stream, and doesn't look for Stream Padding or possibly other Streams.
Needing a check that the options are not contradicting, shows actually that the two options would add complexity without major benefits. As I stated checking for extra data is trivial.
Rob Pike stated in his talk introducing Go the following as one design principle-
Keep concepts orthogonal. A few orthogonal features work better than a lot of overlapping ones.
The two options would be definitely overlapping.
Many thanks again for reporting the issue. I would not have identified it myself. I'm convinced the resolution changing the behavior of the SingleStream option is the right one.
Hi folks, I propose to add an option
SplitStream
in theReaderConfig
. When this option is enabled, theRead
call will not try to append content from next stream. When reading to the end of the current xz stream, theRead
call would return an errorEOS
(End of Stream).There are some daily use cases, such as people could concatenate multiple .tar.xz files into one .xz file. Without giving callers an indication that the current stream is drained, they cannot feed the correct content to the tar reader.
Users could run following command to create a concatenated .xz files from multiple tar.xz files:
, then decompress the combined xz file with