ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Value length using Numeric Types #59

Closed Miosss closed 9 years ago

Miosss commented 9 years ago

ubjson.org states: length (OPTIONAL) A positive, integer numeric type specifying the length of the following data payload. and also There are 8 numeric types in Universal Binary JSON and are defined as: And all those types follow.

First of all, specifying length only applies to string and high-precisions (effectively string also). Even if $# from containers also apply, the case remains the same. Knowing that -> length expressed as float does not make sense...even if 36.56 effectively means 36, it is still redundant, unnecessary, and complicates code a bit. Most of all, it confuses everybody. And more - high-precision numbers are classified as number types also.. that could be potentially quite devastating (due to not bullet-proof implementations abuse). And it is not probable to have strings longer than 9EiB I suppose....

I think, that each value length specification, should only be expressed in: int8, int16, int32 or int64. I do not now what to think about uint8 -> for example, string of length 200 should be described using uint8, or int16? If one chooses uint8, then what with value 10? Is it int8 or uint8?

breese commented 9 years ago

The quote refers to "positive, integer numeric type". Floating-point numbers do not qualify as integers.

I do not understand your point about which integer type should be used. The length encoder is free to choose the integer type it sees fit. The length decoder must handle all integer types.

Miosss commented 9 years ago

Wow, I must have been blind for a moment. So the floats are ok, what about high-precisions?

And as we chase maximum space-optimization, shouldn't encoder be forced to use smallest type possible? Maybe it shouldn't, I'am confused now.

kxepal commented 9 years ago

@Miosss http://ubjson.org/type-reference/value-types/#numeric-sign-min-max TL;DR as per IEEE 754 specification.

dmitry-ra commented 9 years ago

Proposed algorithm of value length detection for current draft implementation (https://github.com/dmitry-ra/ubjson-test-suite)

function findSuitableNumericType(number, optimizeFloats, dataView) {
    if (!isFinite(number))
        return Types.Null;

    if (isInteger(number)) {
        if (number >= MinInt8 && number <= MaxInt8)
            return Types.Int8;

        if (number >= MinUInt8 && number <= MaxUInt8)
            return Types.UInt8;

        if (number >= MinInt16 && number <= MaxInt16)
            return Types.Int16;

        if (number >= MinInt32 && number <= MaxInt32)
            return Types.Int32;

        if (number > MaxInt32 && number <= MaxUInt32)
            return Types.Int64;

        return Types.HighNumber;
    } else {
        if (optimizeFloats) {
            var strNumber = number.toString();
            dataView.setFloat32(0, number);
            var str32 = dataView.getFloat32(0).toString();
            if (str32 === strNumber)
                return Types.Float32;
            if (strNumber.length < 6)
                return Types.HighNumber;
        }
        return Types.Float64;
    }
}

It converts 0.25 as "[d][0.25]" (float32, 5 bytes). It converts 0.3 as "[H][i][3][0.3]" (high-precision, 6 bytes). It converts 0.314159 as "[D][0.314159]" (float64, 9 bytes)

demo: http://dmitry-ra.github.io/ubjson-test-suite/json-converter.html#{%22a%22:0.25,%22b%22:0.3,%22c%22:0.314159}

kxepal commented 9 years ago

bikeshedding check on uInt8 better to be before Int8 one because 0-255 values are more common in the wild than -128-127. Especially when you dealing with strings/huges/sized things. (:

dmitry-ra commented 9 years ago

Agreed. But in all examples we see int8 type for this range of values (4, 7, 8, 5 for example): http://ubjson.org/type-reference/

[i][4][name][S][i][16][monalisa octocat]
[i][7][company][S][i][6][GitHub]
[i][4][blog][S][i][23][https://github.com/blog]
[i][8][location][S][i][13][San Francisco]
[i][5][email][S][i][18][octocat@github.com]
Miosss commented 9 years ago

@kxepal Ok, so the only problem for me here is that it would be cleaner to state that high-precision numbers are not allowed as value-length specification. And about int8/uint8 I wrote a little in #56

ghost commented 9 years ago

@Miosss I think you have a great point about the doc - clarity added - http://ubjson.org/#data_format