ulikunitz / xz

Pure golang package for reading and writing xz-compressed files
Other
484 stars 45 forks source link

missing match limit, was "lzip" #42

Open nitzanb-jfrog opened 3 years ago

nitzanb-jfrog commented 3 years ago

is it possible to work with lzip fles which are also known of using LZMA compress algorithm ?

ulikunitz commented 3 years ago

A simple test with lzip and gxz shows that it doesn't work out of the box. I have had only a few minutes to look into it, but since the documentation at http://www.nongnu.org/lzip/safety_of_the_lzip_format.html talks about a CRC-32 checksum, it appears that lzip files are not pure LZMA streams but may embed those streams in a frame to support CRC-32 checksums.

tansy commented 2 years ago

It is possible to make it work as file format is very simple and "static" and yes, lzip is using lzma. The problem is it is not using the same lzma_lzma1_decoder and lzma_dict structures, and probability table layout. So to make it work one had to adapt probability table and "surrounding" structures.

ulikunitz commented 2 years ago

Hi, I reviewed the file format specification in the link I provided in my last comment.

LZIP appears to be a wrapper format for LZMA streams with fixed properties that uses the EOS marker. It should be possible to write an implementation using the lzma package. You could try it yourself. (I'm currently working on a faster reimplementation of the lzma package. LZMA is working again and I'm currently working on LZMA2, so I cannot promise I will look into lzip.)

tansy commented 2 years ago

That's sweet, thanks. Well, I don't know go. I come from C world and come here looking for different answer about xz. Anyway, I tried to find these probability tables but couldn't, that's one thing, and two there is no Makefile nor INSTALL instruction, so unless one is a prophet there is no easy way to compile it.

ulikunitz commented 2 years ago

If you want to learn Go, you can start here: https://go.dev/doc/

You can install the gxz binary with

$ go install github.com/ulikunitz/xz/cmd/gxz@latest

But this is only a test binary I have implemented for testing. To use the lzma package you would only need to insert

import (
     "github.com/ulikunitz/xz/lzma"
)

in your Go program. The go command will handle the rest. All of this is pretty standard and doesn't need explanation for Go programmers.

ulikunitz commented 2 years ago

The probability tables are in the state structure (lzma/state.go).

tansy commented 2 years ago

If I'm not mistaken this is the table: fastpos_table.c. As I said I don't know go so how does that relate to that? I'm far from understanding it, especially in foreign language (I mean go). And it is foreign to me.

But, regardless, you have no Makefile nor any building scripts that would allow user to compile it. Well, I downloaded go, tried instruction from README and nothing happened. Without that how go you expect anyone to do any changes to the project, which is quite big? There are 79 .go files and not all of them being to the same "subproject". None it's going to investigate which file is what and create building script for such a big project git sake of little change they would like to introduce and test.

PS. I also made a test and converted lz to lzma and lzma to lz by contriving header and copying lzma stream, without making a footer. And the result surprised me - xz decompressed lzma from lz amd lzip decompressed lzma stream from xz. If those probabilities are indeed different then decompressor would fail at some point, wouldn't it?

Attached converter script: lz-to-lzma.zip

ulikunitz commented 2 years ago

fastpos_table is not a probability table. What it does is explained in fastpos.h. I do the calculation for the match distance in line 78 of distcodec.go, not using a fastpos table. I use a nlz32 function, which computes the number of leading zero bits of a 32-bit unsigned integer. The comment in fastpos.h references the bzr machine operation and what I'm doing is comparable to that approach. Actually I did the implementation simply from the specification, I only looked at the reference implementation that were part of the specification.

The implementation for nlz32 is in bitops.go. In my update I will use the LeadingZeros32 function from the math/bits package of the Go standard library. The package didn't exist when I wrote the code. The function call will be translated by the compiler directly into a machine instruction. This will be faster than any table lookup.

Go doesn't require any build scripts. All what is needed is a go.mod file, the rest is handled by the go command. As I wrote, this is natural for a Go developer. Rust doesn't have build scripts as well, it has the cargo command and the Cargo.toml file.

I have to correct myself. you cannot directly do lzip with my LZMA package, you would need to do something comparable like your lz to LZMA converter, because my lzma reader and writer requires the LZMA header. So I need to export a reader and writer for the raw LZMA stream without the header.

sorairolake commented 7 months ago

Hi.

I created a package which supports reading and writing lzip compressed streams using the github.com/ulikunitz/xz/lzma package.

sorairolake commented 7 months ago

@ulikunitz At the moment, is it possible to set the match length limit for lzma.Writer when compressing? The dictionary size and this are used to set the compression level of lzip command.^1

Level Dictionary size Match length limit
-0 64 KiB 16 bytes
-1 1 MiB 5 bytes
-2 1.5 MiB 6 bytes
-3 2 MiB 8 bytes
-4 3 MiB 12 bytes
-5 4 MiB 20 bytes
-6 8 MiB 36 bytes
-7 16 MiB 68 bytes
-8 24 MiB 32 bytes
-9 32 MiB 73 bytes

So I think it would be useful to be able to configure this.

ulikunitz commented 7 months ago

Hi,

Thanks for implementing the lzip format. I will reference it in the README in the future.

Regarding the question regarding such a limit. I have not implemented such a match limit. It looks like it is intended to accept worse compression rates while achieving faster compression. A small match limit makes repeats more probable, which could improve the compression rate again.

I will leave the issue open to reflect the match limit issue. Note that I will support it only for the new implementation v0.6.0, which you can find in the rewrite branch.