Closed daniel-j-h closed 3 years ago
@daniel-j-h
Pull requests invited!
Small pull request at https://github.com/lemire/streamvbyte/pull/33
I appreciate your detailed response :raised_hands: I was looking into https://github.com/iiSeymour/pystreamvbyte to play around and in there we create a np.empty
based on the estimated max compressed bytes. The numpy array always gets allocated (just not initialized) - that's why I thought computing the exact number of bytes required would be great to have in the C version here to begin with.
We don't know the required output memory upfront, so we use a function returning the worst case memory required
https://github.com/lemire/streamvbyte/blob/635d1c5ea63a1304762bba3c3e2e1154e9c83348/include/streamvbyte.h#L28-L35
but in case we are encoding small integers (or small deltas), often times most if not all values fit into a single byte.
In these cases, we still need to allocate upfront
whereas
bytes would suffice.
There are use cases where I'd like to only allocate e.g. 1 GB instead of 4 GB an then throwing out 3 GB immediately after encoding.
Should this library provide a two-pass approach, where
This two-pass approach might be slower in terms of runtime, but we can reduce the allocations required for data bytes by a factor of four in the best case.
Users can write their own version (summing up the bytes required per input item) but having a function in the library would be great for convenience and would allow efficient implementations in the future. Thoughts?