jeromefroe / tsz-rs

A crate for time series compression based upon Facebook's Gorilla whitepaper
https://docs.rs/tsz/
MIT License
86 stars 15 forks source link

The length of the number of leading zeros probably should be 5 bits #13

Open lemolatoon opened 2 months ago

lemolatoon commented 2 months ago

I found in this crate, the number of leading zeros is encoded in 6 bits where the control bit is '11'. However, according to Gorilla's paper, it should be encoded in 5 bits.

(b) (Control bit ‘1’) Store the length of the number of leading zeros in the next 5 bits, then store the length of the meaningful XORed value in the next 6 bits. Finally store the meaningful bits of the XORed value.

The corresponding codes are here. https://github.com/jeromefroe/tsz-rs/blob/b3e2dce64707c42c10019d9159f88a0f594458af/src/encode/std_encoder.rs#L132-L138 https://github.com/jeromefroe/tsz-rs/blob/b3e2dce64707c42c10019d9159f88a0f594458af/src/decode/std_decoder.rs#L143

Is there a specific reason to use 6 bits for the leasing zeros in this crate?

jeromefroe commented 2 months ago

Hi @lemolatoon! It's been a while since I've worked on this crate so I don't recall for sure, but I think I may have just misread the paper and used 6 bits instead of 5 when implementing it 🤦

lemolatoon commented 2 months ago

@jeromefroe Thanks for your reply! I understand the circumstance. I could write some patches to fix this, but it would be a big breaking change. I will keep it. If you think this should be fixed, I'll send a Pull Request, so I would like you to let me know.

jeromefroe commented 2 months ago

If you want to fix it, I'd be more than happy to accept the pull request. It would just have to be released as a new version since as you noted it would be a breaking change.

lemolatoon commented 1 month ago

I was exercising around the Gorilla Compression. I found that If we have more than or equal to 32 bits of leading zeros, we cannot encode it in 5 bits. In that case, we should treat it as 31 bits of leading zeros, and the rest of zeroes as significant bits (in other words, meaningful bits). I found another implementation of Gorilla, and they do like that.

https://github.com/panagiotisl/chimp/blob/main/src/main/java/fi/iki/yak/ts/compression/gorilla/Compressor.java#L81-L84

I an thinking about writing patches and open pull requests for this issue when I am free.

Thank you!