lemire / FastPFor

The FastPFOR C++ library: Fast integer compression
Apache License 2.0
849 stars 124 forks source link

Input length in Simple8b and Simple16 codecs #69

Open gustingonzalez opened 4 years ago

gustingonzalez commented 4 years ago

At the end of decoding in the specified codecs, some checks are performed to verify that the pointer of the input data is less than the estimated data to read: https://github.com/lemire/FastPFor/blob/224fe919760c073566334dde6f510769aff41beb/headers/simple8b.h#L640 https://github.com/lemire/FastPFor/blob/224fe919760c073566334dde6f510769aff41beb/headers/simple16.h#L748

Although this makes sense in debug mode to check consistency, the nvalue parameter is enough to decode. So, if this check does not perform another task than the mentioned, (1) maybe the len parameter could be removed; or (2) the assertions could be modified to allow len = 0, which is useful when the data len is not known , e.g.:

assert(len == 0 || in64 <= finalin64)

lemire commented 4 years ago

(1) maybe the len parameter could be removed;

The assert is meant to check something that ought to be true. Why remove it?

Note that you can remove asserts at compile-time.

the assertions could be modified to allow len = 0

Can you elaborate? Are you saying that you could have len == 0 and in64 > finalin64 as a valid result?

gustingonzalez commented 4 years ago

Consider the Simple8b codec. On decoding, the finalin variable is computed using endin as: https://github.com/lemire/FastPFor/blob/224fe919760c073566334dde6f510769aff41beb/headers/simple8b.h#L511

On the other hand, endin variable is computed as: https://github.com/lemire/FastPFor/blob/224fe919760c073566334dde6f510769aff41beb/headers/simple8b.h#L495

Anyway, these variables are not involved in the decoding process, that is performed by using only the nvalue parameter. This means that the decode will works regardless of the value of endin and finalin variables.

More specifically, if we don't know the input lenght, but we know the nvalue, we could specify the len as 0. So, the decoding process will be correct anyway. However, the endin variable would have been computed at first as: endin = in + len => endin = in + 0 => endin = in

Since endin is equal to the in pointer in its initial state, and at the end of the decoding the in pointer has a greater value, the involved assert will fails.

lemire commented 4 years ago

@gustingonzalez Please submit a PR. In your PR, document that "len = 0" means "len is unknown". I will merge it and you will be credited for the change.