uwiger / sext

Sortable Erlang Term Serialization
Apache License 2.0
106 stars 52 forks source link

Big ints don't sort properly #5

Closed reiddraper closed 8 years ago

reiddraper commented 11 years ago

Riak uses sext for secondary indexes to create binary keys for leveldb. A user noticed that some integer 2i queries were not returning results when they were expected to. We seem to have narrowed the issue down to sext. Here's an example, in an erlang shell:

76> TA = 1000000000000.
1000000000000
77> TB = 2000000000000.
2000000000000
78> TA < TB.
true
79> sext:encode(TA) < sext:encode(TB).
false

You can provoke this bug by using this patch to the EQC tests. h/t to @seancribbs and @vagabond for helping track this down.

Vagabond commented 11 years ago

So, to dig a little deeper with that example:

37> sext:encode(TA).
379: encode_int(1000000000000, none)
432: Bl = [232,212,165,16,0]
434: Bb = <<232,212,165,16,0>>
<<11,244,117,52,177,8,0,8,0>>
38> sext:encode(TB).
379: encode_int(2000000000000, none)
432: Bl = [1,209,169,74,32,0]
434: Bb = <<1,209,169,74,32,0>>
<<11,128,244,117,52,169,4,0,8,0>>

TA is represented with 40 bytes, TB is represented with 41. This means that the first byte in the binary is smaller for TB, even though the binary is longer. Binaries in erlang seem to ignore length for the purposes of sorting, so as soon as it sees that 244 > 128, the comparison returns TA > TB.

One suggestion I had would be to store the number of bits needed to represent the number as the byte right after the tag byte, then your sort order would be a little better, although numbers with a bit size > 255 would be problematic, as would backwards compatability.

uwiger commented 11 years ago

Try git pull https://github.com/uwiger/sext uw-bignum-encode

INCOMPATIBILITY: changed encoding of bignums

Bignums were encoded in a way that didn't preserve the sorting property.

To fix this, I have prepended a length indicator. In order to be
able to read legacy-encoded bignums, I made use of the fact that
the first byte returned in the return value from encode_big1/1
was never > 254. Thus, 255 is used to indicate the new format. I.e.
<<11, 255, ...length indicator..., ...encoded bignum...>>

The length indicator gives the number of bytes, and will usually
be one byte long. If the number of bytes is > 127, the length
indicator will be encoded as a sequence of 7-bit "septets", where
each except the last is tagged with 1 in the MSB.

The decode function removes the size indicator, and also recognizes
that if the first byte is =< 254, there is no size indicator,
and decodes a legacy bignum.

This means that a data set can be converted by simply decoding and
encoding each value.

For further backwards compatibility, the function
sext:legacy_encode_bignum/1 encodes a bignum in the old format.
reiddraper commented 11 years ago

Thanks, I can confirm this passes EQC + my new EQC changes at #6. Woot.

uwiger commented 11 years ago

The change has been merged into master and labeled 1.0 (there are no more bugs now ;-)

2013/3/5 Reid Draper notifications@github.com

Thanks, I can confirm this passes EQC + my new EQC changes at #6https://github.com/uwiger/sext/issues/6. Woot.

— Reply to this email directly or view it on GitHubhttps://github.com/uwiger/sext/issues/5#issuecomment-14453672 .

engelsanchez commented 11 years ago

Hi Ulf. Engel Sanchez from Basho here. I'm one of the people working on the effort to seamlessly fix sext encoded data on our next bug fix release. It seems we are going to need a way to encode an erlang term with big ints encoded using the legacy encoding. The function to do the legacy encoding on a standalone big int is not quite enough for what we need to do. I've read the sext code and understand enough to send you a PR, but in case you prefer to prepare one yourself, please do :). Otherwise, I'll be sending you a PR later today.

uwiger commented 11 years ago

I'm going to sleep now, so feel free. It should be pretty straightforward.

Ulf Wiger http://ulf.wiger.net/

7 mar 2013 kl. 20:50 skrev Engel Sanchez notifications@github.com:

Hi Ulf. Engel Sanchez from Basho here. I'm one of the people working on the effort to seamlessly fix sext encoded data on our next bug fix release. It seems we are going to need a way to encode an erlang term with big ints encoded using the legacy encoding. The function to do the legacy encoding on a standalone big int is not quite enough for what we need to do. I've read the sext code and understand enough to send you a PR, but in case you prefer to prepare one yourself, please do :). Otherwise, I'll be sending you a PR later today.

— Reply to this email directly or view it on GitHub.

jrwest commented 11 years ago

PR for changes #7

benoitc commented 8 years ago

shouldn't it be closed?

uwiger commented 8 years ago

Indeed. Thanks.