Clarify the representation of 'huge'

rickb777 commented 11 years ago

'huge' would appear to be a base-10 representation of the number as a string.

Being awkward, I might prefer base-36 because it is more compact and thereby produce a spec-compliant but incompatible implementation.

The representation of 'huge' needs to be less ambiguous.

ghost commented 11 years ago

Rick, any ambiguity in the spec should be eradicated with prejudice :)

Can you let me know where in the Type Reference (updated section of the spec) the ambiguity is around HIGH PRECISION (aka "HUGE") is? http://ubjson.org/type-reference/value-types/#string

As to the suggestion of Base-36, certainly appealing that it is more compact, but I would reject it for two reasons:

Not immediately intuitive. Not everyone is familiar with base conversions. Also limits the readability of a UBJ dump. If compressibility is an issue, you could also store the base36 encoded value as a STRING?
From what I can tell of Base-36, it is a numeric representation of a number and is bounded by a range of values. The HIGH PRECISION type was meant to represent absurdly large numeric values. I don't see how we can represent that with a numeric type unfortunately.

Let me know if I missed something!

rickb777 commented 11 years ago

The ambiguity is simply not being clear whether the string is a base-ten representation of the decimal value of the huge number, although this is possibly what might be assumed. It is also unclear even whether the 'huge' is a decimal number that allows a fractional part and/or an exponent part, or just a big integer.

The obvious assumption would be that 'huge' is like Java's BigDecimal class (an unlimited precision decimal number) converted to a decimal string and encoded as a UTF8 byte sequence.

However, it would be equally valid to assume that 'huge' is instead like Java's BigInteger (an unlimited size integer) converted to a decimal string and encoded as a UTF8 byte sequence.

Unfortunately, representing 'huge' as a string means a significant expansion in the space required over the binary form. For example, a 32bit integer takes up to ten ascii characters - an expansion from 4 to 10 bytes.

So this could be mitigated by supporting both huge and a new intAny. An example use-case for intAny might be cryptography, and the representation would be .

Meanwhile, hugeDecimal would simply be the decimal (i.e. base-10) representation with an optional fraction and/or exponent.

(Aside - the base36 suggestion is rather tongue in cheek - good idea perhaps but not really what people expect. It's very easy to do in Java, for a bit of fun!)

So to summarise my suggestion,

huge is the decimal (i.e. base-10) representation of an arbitrary number with an optional fraction and/or exponent. The string is encoded as a UTF8 byte sequence.
intAny is an arbitrary-length integer represented with a length (one or four bytes) followed by an array of int8 parts of the number of the same size. The parts are in big-endian order, as per all fixed-length numbers.
When marshalling, the shortest possible integer representation should be selected for each integer to be sent. When unmarshalling, the target data-structure may indicate whether each integer needs to be expanded to a larger size up.

kxepal commented 11 years ago

Base-36 is just a string codec. Why not to use something better, like lzma? (:

To be serious, base-36 is not acceptable since HIPREC value should follow JSON number type specification which allows to have values like -1.93+E190 and I don't feel that it's rationale to apply additional transformations for them.

rickb777 commented 11 years ago

Base36 was a joke, man!

ghost commented 11 years ago

WONTFIX

@rickb777 I understand your point, but I believe this to be an optimization for a very small corner case for the UBJSON specification. Use of high precision values (aka "huge") is expected to be very infrequent. As you pointed out, there are more optimized way to store these values, but I don't want to clutter the spec for a 3% use-case optimization.

As for the clarification of the specification itself, the spec for HIGH PRECISION literally says that the format follows the JSON's spec requirement for the number type -- whatever JSON dictates, we dictate here - ambiguity might be an unfortunate side effect of this (e.g. base36 vs base10 -- using your previous joke) but I am not going to try and reduce that scope in UBJSON spec.

ubjson / universal-binary-json

Clarify the representation of 'huge' #30