ronanh / intcomp

Fast integer compression library
Apache License 2.0
81 stars 4 forks source link

Function signatures #14

Open pascaldekloe opened 1 year ago

pascaldekloe commented 1 year ago

Encoding as uint32 or uint64 slices is a bit cumbersome. We can use byte slices instead without any loss of performance. Go even optimizes the binary.ByteOrder into native operations.

Decoding needs an error return. The Uncompress functions are panic bombs now. This is bad for performance, e.g, pull request #13.

The output destination should always be the first argument, like a C copy or fprintf, and like a Go io.Copy or fmt.Fprintf. Append functions are better of by naming their behaviour explicitly, e.g., strconv.AppendInt, utf8.AppendRune or time.AppendFormat.

DeltaBin and DeltaVar describe the internal mechanism, which is not useful to the API reader/user.

The trailing-zero optimization is just as useful on signed and unsigned integers.

I propose the following API.

const BlockSize32, BlockSize64 = 128, 256
func AppendEnc64[T uint64|int64](dst []byte, values ...T) []byte
func AppendBlockEnc64[T uint64|int64](dst []byte, values ...T) (out []byte, remain []T)
func AppendDec64[T uint64|int64](dst []T, src []byte) (out []T, err error)
func AppendBlockDec64[T uint64|int64](dst []T, src []byte) (out []T, remain []byte, err error)
ronanh commented 1 year ago

Yeah, had the same thoughts.

This whole thing started as a hack since I couldn't find what I needed in other libraries. Then realised that it would not make much sense in having this kind of stuff as close source and decided to publish it with little changes as I already had spent too much time on it. It came as a surprise that it got attention that fast, and now feel stuck with the initial public API.

But it if we choose to change it, better sooner than later.

About encoding to []byte, of course it's much nicer. The java lib which I took inspiration did this, and I thought it was for performance reason, so kept with it. It would be interesting to compare if it's really the case.

I chose to make DeltaBin public since I thought that it may be useful to someone to use the binary encoding (with all the code generation) without the block encoding part that is specific to this library. Does that make sense?

pascaldekloe commented 1 year ago

Publishing makes sense. You definitely want testers and second opinions on something like this.

The v1 tag is unfortunate indeed. You could switch the namespace to github.com/ronanh/deltacomp, with a block sub-directory/package for the header stuff. Good idea to separate, yes.

Just remove the tags first! It will not affect the content on pkg.go.dev. There's a migrate/rename option in GitHub, which preserves the history, and it redirects the current location(s). Flip the go.mod declaration and good to go? 🙃

Bit handling in Java is a disaster, partially due to big-endian on SPARC, and mostly because optimization never made it as a priority there. 🤭 Go switched to native "word moves" a few years ago. I am not sure how ARM64 deals with memory alignment, now I come to think of it. Needs some testing indeed.

I would love it if you drop copyrights and the Apache library in favour of CC0. In that way I can reuse the code in my code generator. No hard feelings if you don't.

pascaldekloe commented 1 year ago

Yup, the alignment holds us back, especially since binary append can only handle one word at a time. I'd go for unsafe conversion from integer to byte slices, which results in native endianness. That way we get bytes for free.

https://gist.github.com/pascaldekloe/9af7d2bb1a1bc88e162037bbc213f374?permalink_comment_id=4523936#gistcomment-4523936

pascaldekloe commented 1 year ago

The output should always apply 64-bit words, regardless of the input width. This is too much of an overhaul perhaps. We got something here that works. Perhaps I can try an alternative version, without copyrights, and share back with you that way?

ronanh commented 1 year ago

Thanks for the benchmarks. Can there be memory alignment issues if make unsafe conversion from ints to []byte? Since we don't control the allocation of compressed buffers I don't think there's any guarantee that it's 8 bytes aligned? See the Counter example trick here to force alignment on []byte.

Conversely, what are the drawbacks of uint slice for compressed data, appart the aesthetics aspect?

pascaldekloe commented 1 year ago

Simply produce and error on unaligned input. This should never happen as we read and consume full words only. A warning in the function documentation should be enough.

if unsafe.Pointer(&in[0]) & 7 != 0 {
        return errors.New("incomp: unaligned input denied")
}

How'd you store or read a []uint64? 😉

ronanh commented 1 year ago

Fine for checking that input is aligned. What I don't get is how do you ensure that the input you supply is properly aligned. Does allocating a multiple of 8 bytes does the trick? What happens when you read serialised data?

pascaldekloe commented 1 year ago

New allocs are always aligned. Just won't work when people continue with custom slices.

pascaldekloe commented 1 year ago

Maybe uint64 slices are not so bad after all, with Go's alignment guarantees in place.

https://pkg.go.dev/github.com/pascaldekloe/wordpack@v0.2.0/pack64#Write

Only problem is making that header fit without too much waste.

jeschkies commented 3 months ago

I just came across this nice library and was wondering what's the best way to write the compressed slice to a byte buffer.

ronanh commented 3 months ago

I just came across this nice library and was wondering what's the best way to write the compressed slice to a byte buffer.

binary package provides a way to serialize primitive types to a byte slice. If you don't need portability converting slices using unsafe is faster.

If you like exploring stuff, I have created another repo ronanh/compexperiments which takes a different approach and uses same type of compression. It's mostly useful for in-memory compressed slices which can be read and modified (append) without decompressing the whole buffer. It's a working library but is unfortunately undocumented, I had plans to create a proper lib but never found the time for that yet.