pierrec / lz4

LZ4 compression and decompression in pure Go
BSD 3-Clause "New" or "Revised" License
878 stars 142 forks source link

No way to get uncompressed data len #134

Closed xakepp35 closed 2 years ago

xakepp35 commented 3 years ago

In the example we have following weird code block:

// Allocated a very large buffer for decompression.
out := make([]byte, 10*len(data))
n, err = lz4.UncompressBlock(buf, out)

What is 10*len(data)? Why it is not 9, and not 11? I am tight on memory and will it fail for factor 9? for 8? for 7? Where is the API to determine exact size of data, that is required for decompression?!

greatroar commented 3 years ago

Here's a function (not very well tested) for computing the decompressed size of a block:

func UncompressedSize(src []byte) (size int64) {
    for len(src) > 0 {
        b := int64(src[0])
        src = src[1:]

        lLen := b >> 4
        if lLen == 0xF {
            for {
                if len(src) == 0 {
                    return -1
                }
                add := int64(src[0])
                lLen += add
                if lLen < 0 {
                    return -1
                }
                src = src[1:]
                if add != 0xFF {
                    break
                }
            }
        }

        size += lLen

        switch len(src) {
        case 0:
            return size
        case 1: // No space for a 16-bit offset.
            return -1
        }

        offset := int64(binary.LittleEndian.Uint16(src))
        if offset == 0 {
            return -1
        }
        src = src[2:]

        mLen := b & 0xF
        if mLen == 0xF {
            for {
                if len(src) == 0 {
                    return -1
                }
                add := int64(src[0])
                mLen += add
                if mLen < 0 {
                    return -1
                }
                src = src[1:]
                if add != 0xFF {
                    break
                }
            }
        }
        mLen += minMatch
    }

    return -1
}

I've been thinking of submitting this as a PR, but haven't got round to it yet. In particular, it doesn't validate offsets and doesn't handle preset dictionaries ("linked blocks").

pierrec commented 3 years ago

When using the LZ4 block compression, there is no way to easily get the size of the uncompressed data, it is left to the user of the format to handle it the way that suits him/her. The typical way is to prefix the compressed data with the information.

I have attempted to do what @greatroar proposes but it makes decompression pretty much twice as slow.

In general though, it is better to use the LZ4 frame format than the raw block one.

greatroar commented 3 years ago

I'm not proposing to do this before decompressing. It might be a useful freestanding function for some applications, but I must admit I wouldn't use it myself.

pierrec commented 2 years ago

Please use a custom format if needed. LZ4 will only support the standard LZ4 format as per the reference implementation.