ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Fixed-length N-Dimensional Arrays for UBJSON #61

Closed Steve132 closed 9 years ago

Steve132 commented 9 years ago

Fixed-length N-Dimensional Arrays for UBJSON

So, a LOT of the current discussions regarding dramatic changes to UBJSON that we are currently discussing seem to stem primarily from Issue https://github.com/thebuzzmedia/universal-binary-json/issues/43

In that issue, basically, as a way of handling dense multi-dimensional data inside an STC, a complex recursive schema header was proposed.

I pointed out that a simple "multi-dimensional" array marker would allow for dense multi-dimensional data to be stored in an array contiguously without the complexities of a recursive schema.

Eventually, however, the conversation devolved to debating various schemas and it was decided that debating the merits of more complicated schemas or repeating headers (or other issues) would be moved to other issue numbers.

Issue #43 was resolved by saying that "Yes, STCs Can be embedded into other STCs"

However, this resolution did not actually address the original use case of @edgar-bonet which inspired ALL of this discussion, which is the use case of being able to store dense multi-dimensional data inside STCs, which would allow him to store his ND arrays from JSON and his application efficiently.

Motivation:

Dense multidimensional arrays of contiguous numbers are incredibly useful for many of the most important use cases of UBJSON. Including image transport, data from sensors, matrices, scientific data, etc.

If this data has to have a parsing step, or has intermediate markers, then it becomes much more difficult to use as a block of memory or use as a mem-mapped array in a file.

Including an optional marker to specify multi-dimensional data would allow clients to preserve the structure of multi-dimensional data in JSON or other markup formats while also allowing for extremely efficient parsing and use.

Proposal:

This proposal builds on draft 12 (current draft with nested STCs). No other drafts should be considered.

Instead of a '#' marker, a '@' marker may be found in the stream for a container header. If a '@' marker is found, the next byte is an 8-bit integer describing the number of dimensions that container has. (This is intentional, as a 257-dimensional array seems absurd on its face and making this number huge could be a way for a malicious data producer to attack an implementation. Making this number always 8 bit saves us one byte and no serious application will ever have a >255-d array).

Following the dimensionality indicator number (D), there will then be D UBJSON integer values (dim[0]->dim[D-1]) that give the dimensions of the array.

The dimensions are ordered in 'increasing' order...meaning, that dim[0] describes how many values to skip in the linear data to get to the end of the first line, dim[0]*dim[1] is the number of values to skip in the linear data to get to the end of the first 'square', dim[0]*dim[1]*dim[2] is the number of values to skip to get to the end of the first 'cube', etc.

Example: A 720p bitmap image of 8-bit pixels

[[][$][U][@][3][i][3][I][1280][I][720]
    [23][23][42] [23][13][12] [13][13][42] ...Repeat 1280 sets of 3
    [43][43][42] [43][43][12] [23][53][32] ...Repeat 1280 sets of 3
    ...Repeat 720 rows.

...no closing marker needed

It's important to note that the binary representation of the data is just DIRECTLY the binary representation of the data that is expected by most image applications.

One can embed these kinds of structures and large matrices directly into scientific programs or memory. They can be loaded efficiently with NumPy. They can even directly be memory-mapped for big scientific datasets.

This can be added easily and simply to existing parsing code by simply computing the size of the STC array as the product of all the dimensions, and then calling the code associated with a linear fixed-length STC as normal to parse the data into a linear array.

Advantages:

By embedding support for multi-dimensional arrays directly into UBJSON, we automatically standardize the protocol for multi-dimensional data across applications that use UBJSON.

Applications that require multi-dimensional data will not have to develop (possibly non-standard) application layers to support it, thus increasing compatibility over many different kinds of data formats.

For example, without this proposal, matlab's UBJSON implementation might choose to encode a multi-dimensional array in an application-specific way like this:

[{]
    [i][4]["Dims"]
        [[][$][I][#][3][3][1280][720]
    [i][4]["Data"]
        [[][$][U][#][2764800]
            [23][23][42][23][13][12][13][13][42]....
[}]

Whereas Numpy or an Imagemagick writing tools might choose to encode a multidimensional array in it's OWN way

[{]
    [i][8]["channels][I][3]
    [i][7]["columns"][I][1280]
    [i][4]["rows"][I][720]
    [i][4]["pixels"]
        [[][$][U][#][2764800]
            [23][23][42][23][13][12][13][13][42]....
[}]

It's clear that these applications data-types should be default compatible. Although we cannot FORCE them to use this feature, if ND-arrays were supported natively in UBJSON then applications would be more likely to use them, thus increasing the overall compatibility of many applications.

It also has an advantage in that it allows us to support VERY efficient conversions to/from JSON or other markup formats that use multi-dimensional data, WITHOUT changing the data structure or requiring application-level schemas like the ones defined above.

In @edgar-bonet's case, he wanted to have an automated tool recognize [ [23.1,32.0],[24.1,54.0] ] and convert it properly to an efficient STC in contiguous memory with no extra markers and back if converted to/from JSON. This use case was never resolved.

Disadvantages:

It requires yet another 'type' overload for [] which may rub some people the wrong way, when we already have code to handle '$' or '#'

It adds a small amount of code complexity.

Arguably the data structure should be an application-level decision.

There's no direct "N-D array" type in JSON, only arrays of arrays.

It's not at all clear what should be done in the case of an object. Up till now the headers for arrays and objects have been unified. They should obviously still retain shared code to parse the header, but that leaves an open question about what to do if the parser encounters [{][$][S][@][5] What does a 5-dimensional key-value store even mean? More on this later.

Implementation notes:

The array type of the parser can still internally store a linear memory, but have overloaded index operators to get the data if it is stored linearly internally (like this, in C)

size_t ubjr_ndarray_index(const ubjr_array_t* arr, const size_t* indices)
{
    //multi-dimensional array to linear array lookup
    size_t cstride = 1;
    size_t cdex = 0;
    uint8_t i;
    uint8_t nd = arr->num_dims;
    const size_t* dims = arr->dims;
    for (i = 0; i<nd; i++)
    {
        cdex += cstride*indices[i];
        cstride *= dims[i];
    }
    return cdex;
}

Or like this in python

def __call__(self,*indices):
    cstride=1
    cdex=0
    for k,v in enumerate(self.dims):
        cdex+=cstride*indices[k]
        cstride*=v
    return self.data[cdex]

u=ubjson.Array()
u(13,42) #n-d array

In FACT, in python there is numpy that automatically does this, so if you want to use numpy, then the parser would just look like

ndims=ubjson.readByte()
dims=[ubjson.nextInt() for range(ndims)]
np.array.fromfunction(ubjson.scanLinearArrayData(prod(dims)),tuple(dims),dtype=int)

If the parser API is a streaming API, then just stream the linear data as normal.

What about Objects?

As I said, it is an open question what to do with objects.

There are three possibilities I see:

The first possibility is to interpret a 'multi-dimensional object' as an object that expects multiple keys for each value. For example,to store geographic data, one could have two (String) keys for longitude and latitude

[{][$][U][@][2]
    [i][6]["32.423"][i][6]["231.42"][123]
    [i][6]["12.423"][i][6]["131.42"][16]

I strongly dislike this idea. It has no direct mapping to JSON and seems weird.

The second possibility is to have the @ marker basically be ignored. The dimensions are still parsed, but the product of the dimensions are interpreted as the size to use for a fixed-size linear object, just as if it was specified with '#'.

The third possibility is to make a @ marker invalid for objects.

Can this be implemented?

Yes, there is an implementation of this proposal (for arrays only) in the current version of UBJ

kxepal commented 9 years ago

I'll port some features from #50 as questions here:

[["Draft-8", false], ["Draft-9", false], ["Draft-10", false], ["Draft-11", true], ["Draft-12", true], ["Release", false]]

If that possible to encode it with using ND UBJSON array?

kxepal commented 9 years ago

And more interesting question is what about array of objects?

Miosss commented 9 years ago

Instead of a '#' marker, a '@' marker may be found in the stream for a container header. If a '@' marker is found, the next byte is an 8-bit integer describing the number of dimensions that container has.

So basically @Steve132 proposes to implement binary equivalent of C-arrays in memory, to UBJSON.

Example: A 720p bitmap image of 8-bit pixels

AFAIK, STC's (which you defend) in Draft12 enable exactly the same thing. Just flatten your 3D image bytes into stream of bytes and encode is as $# array on uint8. While the status of STC didn't change yet, you can even cast your entire array (even 3D) to uint8 array without any conversions.

Goals?

I think we now discuss many options that interfere with each other. Maybe I do not precisely know, so could anyone explain to me, or shall we think together what are we trying to do: we now have quite decent version, that allows some optimizations, yet is still equivalent to JSON

What are we trying to do next?

So. UBJSON is ub JSON so the compatibility with JSON is goal number 1 as stated on site. Therefore the main focus should be on optimizing common usages of JSON (as there aren't such widely spread usages of UBJSON yet). What JSON is mostly used for? Strings. They do not optimize well (at all).

If we want to optimize UBJSON further, we should focus on trying to optimize strings and objects. Objects are important in this, because the common construct is array-of-objects, where all objects are of the same type . Thay have identical keys, and same value types. This is what @kxepal called "Records".

Now to the original issue. Optimizing mostly arrays isn't key to go. This is not personal, but I am quite interested, why such large scientific data should use UBJSON instead of some binary dumps or whatever standard. Yet I think that this is not the case where JSON is most popular and because of that, optimizations limited to only STArrays, even if multidimensional, isn't crucial.

I can agree, that repeatable headers may not give many benefits but still complicate many things. I switched off the discussion to think about it more clearly. I cannot agree though, that those schemas, or typespec are completely bad idea. I wrote about my misunderstanding and misusing of this concept, but if there is any way to go now, then it should probably be that (Records optimizations, and maybe true STC along with it - how, it is still open).

Sorry for the philosophy part (again), but if @thebuzzmedia could explain me, why and where is me understanding of the topic wrong, I would be grateful.

Miosss commented 9 years ago

By the way. Looking at popular, APIs for example, that use JSON (REST's etc.) you will see that arrays of objects (Records) are common.

Steve132 commented 9 years ago

Firstly, I'd like to say that the right way to understand this proposal is to notice that really all it is four small things:

Are strings allowed for dimension elements?

No. Why would they be? A string is not a length. Strings are not allowed as lengths of fixed-size arrays after # markers, so strings are not allowed as lengths of dimensions after @ markers.

Strings ARE allowed as array elements, of course, just like they are for a # array

How to specify booleans? Say, I have a following structure:

This is a 2-D array containing both strings and booleans. Therefore the array is MIXED TYPE.
Therefore, it loses some (but not all) of the performance benefits.

Remember that this proposal just adds metadata to the way we would encode an equivalent linear array.

Imagine the following 1-D array

    ["Draft-8", false, "Draft-9", false, "Draft-10", false, "Draft-11", true, "Draft-12", true, "Release", false]

you'd encode it in the currrent UBJSON like

[[][#][i][12]   //Mixed-type fixed-length 1-D array (because the elements mix types)
    [S][i][7]["Draft-8"]  //string
    [F]                        //boolean
    [S][i][7]["Draft-9"]
    [F]
    [S][i][8]["Draft-10"]
    [F]
    [S][i][8]["Draft-11"]
    [T]
    [S][i][8]["Draft-12"]
    [T]
    [S][i][7]["Release"]
    [F]

Under this proposal, if you had the following 2-D array,

[["Draft-8", false], ["Draft-9", false], ["Draft-10", false], ["Draft-11", true], ["Draft-12", true], ["Release", false]]

You would encode it as

[[][@][2][i][2][i][6]   //Mixed-type fixed-length 2-D array (because the elements mix types)
    [S][i][7]["Draft-8"]  //string
    [F]                        //boolean
    [S][i][7]["Draft-9"]
    [F]
    [S][i][8]["Draft-10"]
    [F]
    [S][i][8]["Draft-11"]
    [T]
    [S][i][8]["Draft-12"]
    [T]
    [S][i][7]["Release"]
    [F]

Simple. It's the same as a 1-D array but with different metadata that allows standardized interpretation as a 2-D array.

Say, I have 100_100_100*100 numpy array and I want to encode it with using this optimization. How ever I will know which UBJSON type to use in header without iterating over each dimension?

Respectfully, I think this question is somewhat misguided or misphrased or something, because it makes no sense to me.

If you have a 100_100_100*100 numpy array of float 32s, then you encode it as

[[][$][d][@][4][i][100][i][100][i][100][i][100]  //fixed-size 100x100x100x100 array of type 'd'
     [32.5f][122.4f][0.1f] ... //100 M more 4-byte float values continue, just like a linear array

Its simple.

If you are trying to figure out the 'minimum size integer type' that can be used for your data, then yes you'd have to do it the same way you'd do it in the linear-case (remember this is just metadata for linear) and iterate through the array and find the global maximum of all your data. I can tell you that almost nobody actually does this in practice, but it's no worse than the linear 1-D case if you did want to do it.

And more interesting question is what about array of objects?

An linear array of objects is allowed according to the current spec.

[[][$][{][#][i][9]
    [{]
        ...
    [}]
    [{]...[}]
    ...//7 more objects

So an ND-array of objects is ALSO allowed (remember, the only change is a metadata change)

[[][$][{][@][2][i][3][i][3]
    [{]
        ...
    [}]
    [{]...[}]
    ...//7 more objects

Simple.

On to @Miosss questions:

AFAIK, STC's (which you defend) in Draft12 enable exactly the same thing. Just flatten your 3D image bytes into stream of bytes and encode is as $# array on uint8. While the status of STC didn't change yet, you can even cast your entire array (even 3D) to uint8 array without any conversions.

Yes, I mentioned this with my two hypothetical examples from matlab or numpy. Obviously flattening arrays is possible at the application level already.

The advantage here is a standard way to encode this metadata so that applications have a higher chance of being cross-compatible, and so that tools can encode/decode N-D arrays to/from other formats correctly without being informed of the application-level schema.

If we want to optimize UBJSON further, we should focus on trying to optimize strings and objects. Objects are important in this, because the common construct is array-of-objects, where all objects are of the same type . Thay have identical keys, and same value types. This is what @kxepal called "Records".

Now to the original issue. Optimizing mostly arrays isn't key to go. This is not personal, but I am quite interested, why such large scientific data should use UBJSON instead of some binary dumps or whatever standard. Yet I think that this is not the case where JSON is most popular and because of that, optimizations limited to only STArrays, even if multidimensional, isn't crucial.

In my opinion, this is exactly my point. Web-based services that are currently using text-based JSON will continue to use text-based JSON, or store their stuff in databases because UBJSON simply CANT optimize dynamic length objects and strings much better than JSON. Strings and objects aren't really that compressible that much in a binary format, so why would I start to use UBJSON when I run a web service? It's not supported by any of my tools or APIs and provides little benefit if I have mostly strings and objects and dynamic types.

What UBJSON CAN optimize better is BINARY and NUMBERS. 122.03f is 7 bytes in JSON, 4 bytes in UBJSON. 100423312 is 9 bytes in JSON, 4 bytes in UBJSON. Where we are going to get big gains from having a binary format is in datasets that have lots and lots of numbers and other binary data and not much else.

Who generates these datasets? Who is our target market? I think you know the answer ;)

Yes JSON has mostly objects and strings. This is because JSON is comparatively good at objects and strings. Pretty much as good or better than UBJSON will ever be, because a string in binary looks pretty much the same as a string in text.

large scientific data should use UBJSON instead of some binary dumps or whatever standard.

Because there IS no standard, because nothing is good at handling binary data and lots of numbers in a simple and fast way. Data interchange for bigdata is nearly impossible because people use their own made-up-whatever standards, which is terrible.

Big-Data uses non-standard binary dump formats because nothing like UBJSON exists for them and they can't use JSON because it's BAD at what they need.

By the way. Looking at popular, APIs for example, that use JSON (REST's etc.) you will see that arrays of objects (Records) are common.

Yep, and we currently handle that case as well as JSON and this proposal does nothing to change it.

I can agree, that repeatable headers may not give many benefits but still complicate many things. I switched off the discussion to think about it more clearly. I cannot agree though, that those schemas, or typespec are completely bad idea. I wrote about my misunderstanding and misusing of this concept, but if there is any way to go now, then it should probably be that (Records optimizations, and maybe true STC along with it - how, it is still open).

Lets keep discussions of those proposals (which are orthogonal to this one) to their own issue #'s :)

kxepal commented 9 years ago

@Miosss

AFAIK, STC's (which you defend) in Draft12 enable exactly the same thing. Just flatten your 3D image bytes into stream of bytes and encode is as $# array on uint8. While the status of STC didn't change yet, you can even cast your entire array (even 3D) to uint8 array without any conversions.

Good point. Also image isn't good example: you will want to store them as binary string of int8 and work with not on binary level, but via some library which all accepts path to file, file object or binary string.

What JSON is mostly used for? Strings. They do not optimize well (at all).

Well, string in UBJSON are optimized well enough if you work with Unicode text: JSON requires to encode any character out of the spec in \uxxxx notation which means 6 bytes per character while UBJSON uses only 2-4 bytes because of UTF-8. Moreover, with STC binary strings are not a problem any more while it's the pain for JSON for all time. I think we'd solved that issue not the best, but good enough to move on.

Array of similar objects is another problem to focus on as well as array JSON numbers. Pretty real cases which UBJSON could win.

kxepal commented 9 years ago

@Steve132

Simple. It's the same as a 1-D array but with different metadata that allows standardized interpretation as a 2-D array.

Comparing 1-D and 2D arrays definition for mixed content what's the point then of this proposal? All what I see in your examples is that you trying to save few bytes for very specific and rare case without solving real problems of having nested containers with similar data schema. What I'd like to see is that ability to specify element types for each dimension.

Say, I have 100100100*100 numpy array and I want to encode it with using this optimization. How ever I will know which UBJSON type to use in header without iterating over each dimension?

Respectfully, I think this question is somewhat misguided or misphrased or something, because it makes no sense to me.

If you have a 100100100*100 numpy array of float 32s, then you encode it as...

Sorry, but in Python there is no float32 as like as int8/int16/intX. There are just int and float as like as many dynamic languages has. This optimization makes no sense for them since they don't know how to use it right because of non 1:1 UBJSON to language type match.

meisme commented 9 years ago

I think @Steve132 has a very good point when he says that there's no point in optimizing spesifically for the JSON usecase, because JSON works perfectly well for those applications. I don't see them moving over. Rather, it's the binary part that must be leveraged.

@kxepal that part was about arrays in numpy, and numpy does support float32.

Steve132 commented 9 years ago

What I'd like to see is that ability to specify element types for each dimension.

This makes no sense. A multi-dimensional array does not have different types for different dimensions. I'm sorry, but please explain this better.

If I have a table of 256 x 256 floats, then there is ONE data type. FLOAT. There are TWO dimensions.

Consider Java

Type typevar=new Type[100][100];

That declares a multi-dimensional array of 100x100 "Type" objects. In what way is there a different type along each dimension?

Python there is no float32 as like as int8/int16/intX. There are just int and float as like as many dynamic languages has.

Yes there is. In numpy they have float32 vs float64 and int8,16,64, etc as the types of the underlying arrays, for performance. In Vanilla python the default "int()" type is 32 bit and can be promoted to a bignum transparently, and the float() type is ALWAYS a float64 (https://docs.python.org/2/library/stdtypes.html#typesnumeric)

This optimization makes no sense for them since they don't know how to use it right because of non 1:1 UBJSON to language type match.

It makes perfect sense. if you are using numpy its natural and basically already supported. If you are using vanilla python then floats are float64 and ints are int32.

Besides, your criticism about there being not a 1:1 type match applies to ALL of ubjson. UBJSON has int8,int16,int32,int64,float32 etc ALREADY, this proposal doesn't add them. If that's somehow a problem for python then it would be a problem regadless of this proposal. In fact, this proposal doesn't change the underlying types UBJSON supports at all. UBJSON ALSO supports 1-D arrays, which I assume you don't have a problem with either, and this proposal doesn't change any aspect of that.

It's not right to think of this as an optimization. It's not. Like I said, this proposal just adds metadata to the 1-D fixed-size array that we already have. Any parsing or data optimization concerns you have with this proposal would ALSO apply to 1-D arrays equally.

All what I see in your examples is that you trying to save few bytes for very specific and rare case without solving

It doesn't save any bytes vs the 1-D case, and N-D data is absolutely not rare. I'd argue that N-D binary data is probably significantly more common than 1-D binary data 'in the wild'

The goal here isn't datastream efficiency. The goal here is cross-compatibility. If most applications that use UBJSON use it for large streams of binary data (which is basically the use case that UBJSON is designed for), and most large streams of binary data are linearized N-D arrays with N-D metadata at the application level (which they are), then there's a compatibility advantage to allowing that metadata to be standardized.

without solving real problems of having nested containers with similar data schema.

I agree. This proposal is not designed to address a 'schema specification for nested containers' It has nothing to do with that. It's completely orthogonal.

Steve132 commented 9 years ago

@kxepal

I think I can answer your 'encoding' question easily by asking YOU an identical question.

Suppose I have a 1-D array of 100000000 floats. How would I write it using an $# array using the current spec?

Whatever your answer is to that question, the answer to your question of how you would write it under this proposal is the same.

kxepal commented 9 years ago

@meisme

@kxepal that part was about arrays in numpy, and numpy does support float32.

As well as Python has typed arrays, which are rarely used. What I pointed on is that it doesn't make a sense for builtin types. Numpy itself is thirdparty scientific tool. If we're going to make UBJSON science-friendly I wonder why we ignore RDF in first place and others more common structures.

@Steve132

What I'd like to see is that ability to specify element types for each dimension. This makes no sense. A multi-dimensional array does not have different types for different dimensions. I'm sorry, but please explain this better.

If I have a table of 256 x 256 floats, then there is ONE data type. FLOAT. There are TWO dimensions.

Which float? float32 or float64? single type everywhere or not? What I'd like to see is what I'd proposed in #50:

[[][#][i][5][[][S][F][]]
[i][7][draft-7][10.1]
[i][7][draft-8][16.32]
[i][7][draft-9][11.12]
[i][8][draft-10][14.32]
[i][8][draft-11][20.12]
[]]

2D STC array which could apply optimization for arbitrary mixed types in nested arrays. Here we have a profit from defining dimensions and element types in each. Again, what the profit from just specifying number of dimensions for arrays?

Python there is no float32 as like as int8/int16/intX. There are just int and float as like as many dynamic languages has.

Yes there is. In numpy they have float32 vs float64 and int8,16,64, etc as the types of the underlying arrays, for performance. In Vanilla python the default "int()" type is 32 bit and can be promoted to a bignum transparently, and the float() type is ALWAYS a float64 (https://docs.python.org/2/library/stdtypes.html#typesnumeric)

numpy isn't a part of standart library. Do you suggest to make it required dependency? As about int / float sizes: do you suggests that for Python I should always use int32 / float64 even for numbers in int8 range? I wonder what's your recommendation will be for JavaScript where there are no integers nor floats, just numbers. In fact of this, library for such languages should just check which UBJSON number type value fits and only after that encode it properly.

This optimization makes no sense for them since they don't know how to use it right because of non 1:1 UBJSON to language type match.

It makes perfect sense. if you are using numpy its natural and basically already supported. If you are using vanilla python then floats are float64 and ints are int32.

See questions above. Are you suggest for Python encode, say, number 8 as int32? Sorry, nobody will use UBJSON in this case because for JSON it'll be same 1 byte, not 4.

Besides, your criticism about there being not a 1:1 type match applies to ALL of ubjson. UBJSON has int8,int16,int32,int64,float32 etc ALREADY, this proposal doesn't add them. If that's somehow a problem for python then it would be a problem regadless of this proposal. In fact, this proposal doesn't change the underlying types UBJSON supports at all. UBJSON ALSO supports 1-D arrays, which I assume you don't have a problem with either, and this proposal doesn't change any aspect of that.

See above.

It doesn't save any bytes vs the 1-D case, and N-D data is absolutely not rare. I'd argue that N-D binary data is probably significantly more common than 1-D binary data 'in the wild'

Could you provide us any examples?

The goal here isn't datastream efficiency. The goal here is cross-compatibility. If most applications that use UBJSON use it for large streams of binary data (which is basically the use case that UBJSON is designed for), and most large streams of binary data are linearized N-D arrays with N-D metadata at the application level (which they are), then there's a compatibility advantage to allowing that metadata to be standardized.

Cross-compatibility with what? Could you expand this with examples?

Suppose I have a 1-D array of 100000000 floats. How would I write it using an $# array using the current spec?

With the current spec nohow but using the type with highest capacity hoping that all the floats will fit float64 size. It would be a pity if they don't. That's why I proposed multiheader feature to optimize numbers of different sizes within single array. I remind you again that in JSON there are no integers or floats, there are just numbers.

Steve132 commented 9 years ago

numpy isn't a part of standart library. Do you suggest to make it required dependency?

Your original question was "Say, I have 100_100_100*100 numpy array and I want to encode it with using this optimization. How ever I will know which UBJSON type to use in header without iterating over each dimension?"

The question specified numpy, so my answer included numpy. It also included not-numpy. No it doesn't have to be a required dependency. That's up to the implementation.

Which float? float32 or float64? single type everywhere or not? See questions above. Are you suggest for Python encode, say, number 8 as int32? Sorry, nobody will use UBJSON in this case because for JSON it'll be same 1 byte, not 4.

ALL of these questions are red-herrings. The answer is "Do what you do for a 1-D array". This proposal is the SAME as the current $# except with a different kind of metadata. All of these questions about types and encodings and whether or not it's possible to implement in a specific language can be resolved by simply examining the 1-D case for $#.

I'll say that simpler. The answer to 'What do I do about X data' is ALWAYS "do what you do for the $# case"

With the current spec nohow but using the type with highest capacity hoping that all the floats will fit float64 size. It would be a pity if they don't.

Ok so do that here. Question answered. Again, the answer of "How do I do X data with this proposal?" is always "How do you do X data with $#"

I remind you again that in JSON there are no integers or floats, there are just numbers.

But in UBJSON there ARE static types...so we already have to deal with that. Questions about dealing with different datatypes have nothing to do with this proposal.

Could you provide us any examples?

Images, audio, matrices, 3D models, sensor data, position data, accelerometer data, netflix ratings, similarity scores, heightmaps, position data, trajectories, AI neural-network parameters, depth data from a kinect...and many many others.

Cross-compatibility with what? Could you expand this with examples?

Currently, we support linear 1-D data. Applications working with N-D data will store it as 1-D and design their own metadata to describe how to unlinearize that 1-D data. I already gave examples in the original proposal of how they might do this.

The benefit of including a standard metadata for N-D data is that applications that use it will be interoperable without needing to be aware of the application-level linearization parameters. For example, numpy and matlab both currently have their own proprietary optimized binary formats to store 2-D arrays of floats. These formats aren't compatible with each-other. However, numpy and matlab could both have an 'export to ubj' option that would provide a common interchange format for datasets.

If a 3D point cloud program knew to expect a ubj file containing at least 1 3xn floating-point array, you could write a script in matlab or Java that would populate a 3xn array and export it to ubj and import it into the point cloud program. Boom. Interoperability as good (or better!) than JSON, but with binary data for big datasets.

Without this proposal, each toolkit would likely define it's own 'schema' for N-D metadata around a linear stream and those schemas are not likely to be compatible.

kxepal commented 9 years ago

Images, audio, matrices, 3D models, sensor data, position data, accelerometer data, netflix ratings, similarity scores, heightmaps, position data, trajectories, AI neural-network parameters, depth data from a kinect...and many many others.

Sorry, but nobody works with images, audio and 3D models on binary level as with ND arrays, because where are special libraries which known how to handle such data right. And they commonly need to receive a 1D stream of binary data to go ahead. Sensors data is very different: it may match ND array, it may be not. Coordinates are fixed pair/triples of floats. Ratings a re table data which is actually returns us to array of objects issue. So again, is there any real examples for that feature?

Without this proposal, each toolkit would likely define it's own 'schema' for N-D metadata around a linear stream and those schemas are not likely to be compatible.

Without this proposal each toolkit will explicitly use array of arrays structure without reinventing a wheel. Completely interpolatable, simple and clean solution. So what ND-arrays trying to solve? I already noticed that they don't try to optimize anything, so there should be any other purpose.

meisme commented 9 years ago

I have seen all of those, images, audio tracks and 3D models represented as ND arrays in programs and scripts I've worked with. Particularly Matlab is fond of representing everything as an ND array. And there are many other forms of data that you work with that don't have a standardized file format, and that's being kept in ND arrays. You can't really be saying that such data doesn't exist or isn't common.

The array of array structure has the disadvantage that for many applications it will have to be copied to different memory (or a raw file) to be continuous to allow easy access. I do agree with @Steve132 that such applications are likely to flatten the data to be able to access it continuous.

@kxepal What would you say is the goal of UBJSON? What is the purpose of the format?

kxepal commented 9 years ago

@meisme the goal for UBJSON is to provide a binary format which is in transitive relations with JSON and compatible with it while solving his oblivious design flaws for handling realworld data.

What you trying to do now is to solve Matlab / Numpy / Imagemagic issue of storing same ND-array data in different proprietary formats what shares nothing with JSON.

kxepal commented 9 years ago

I have seen all of those, images, audio tracks and 3D models represented as ND arrays in programs and scripts I've worked with. Particularly Matlab is fond of representing everything as an ND array. And there are many other forms of data that you work with that don't have a standardized file format, and that's being kept in ND arrays.

They may be stored as like as the want to be stored on binary level. I don't care about that. What I do is that image, audio library is able to correctly load that format and work with. Why it's important for UBJSON to handle this issue I don't understand. Are we going to push UBJSON into Matlab? Are Matlab people agree with us?

meisme commented 9 years ago

I am not trying to design a new format for any language in particular.

What's happening here is that I'm reading you as saying "ND arrays don't occur in the wild, and so they're not worth considering". I'm arguing that they absolutely do occur, and that we should have the discussion.

In fact I haven't even decided if I support this proposals way of solving the problem yet.

Steve132 commented 9 years ago

Sorry, but nobody works with images, audio and 3D models on binary level as with ND arrays,

Yes, yes they absolutely do. ALL of them do. I can go find you source code if you want.

because where are special libraries which known how to handle such data right.

How do you suppose those libraries work?

And they commonly need to receive a 1D stream of binary data to go ahead.

Which they immediately reinterpret as an ND array.

Sensors data is very different: it may match ND array, it may be not.

All my sensor data does. All sensor data I've ever seen does.

Coordinates are fixed pair/triples of floats.

[[][$][d][@][3][i][3][I][10000] ....//Array of fixed triples of floats

Without this proposal each toolkit will explicitly use array of arrays structure without reinventing a wheel.

No, they won't, because an array of arrays is not well supported in UBJSON, it's not contiguous memory, it's not random access. It's useless to them.

What they WILL do is linearize their data to a 1-D data stream, and store it along with some metadata to describe how to unlinearize it.

So what ND-arrays trying to solve? I already noticed that they don't try to optimize anything, so there should be any other purpose.

It's trying to make it so that when applications linearize their data it remains standardized and that tools can convert those arrays to/from json/xml/other and still retain the ND-array structure.

Are we going to push UBJSON into Matlab?

http://ubjson.org/libraries/

oblivious design flaws for handling realworld data.

The only way JSON is 'flawed' is in the way it handles number-heavy arrays of data. For strings and objects JSON is about as good as you can reasonably get. A binary format's strength is the way it handles number-heavy arrays of contiguous data efficiently. UBJSON in my opinion steps in to fill in that gap and fill in that use case for people who have large number-heavy streams.

oblivious design flaws for handling realworld data.

If realworld use cases for large number-heavy streams includes mostly N-D array data that has been linearized down by the application and stored with some metadata, then having a standard way to do that metadata is worth discussing to increase interoperability and common client use cases.

Steve132 commented 9 years ago

Check it out, there's even already people doing exactly what I'm saying they will do:

https://github.com/fangq/jsonlab/blob/master/saveubjson.m#L246

As far as I can tell, mat2ubjson(randn(50,50)) outputs something like

[{]
    [i][11]["_ArrayType_"][S][i][7]["float64"]
    [i][11]["_ArraySize_"]
        [[][$][i][#][i][2]
            [i][50]
            [i][50]
    [i][11]["_ArrayData_]
        [[][$][D][#][U][2500]
            [1.0][0.5]...//2498 more float64 random numbers
[}]

This is what the matlab2ubjson implementer CURRENTLY does (disclaimer, I didn't write that code and have never seen that code before 10 minutes ago..it was just obvious that's what they would do so I knew it would be there)

Now, this schema for ND arrays is absolutely not standard, could not be read as an ND array by anything else. In contrast, under this proposal the matlab2ubjson implementer could just output this instead:

[[][$][D][@][2][i][50][i][50]
    [1.0][0.5]...//2498 more float64 random numbers

Which could be read by any UBJSON implementation.

kxepal commented 9 years ago

And they commonly need to receive a 1D stream of binary data to go ahead.

Which they immediately reinterpret as an ND array.

And you believe that if you'll pass them ND array instead of 1D binary stream they'll work? You have any API examples of such?

Without this proposal each toolkit will explicitly use array of arrays structure without reinventing a wheel.

No, they won't, because an array of arrays is not well supported in UBJSON, it's not contiguous memory, it's not random access. It's useless to them.

What they WILL do is linearize their data to a 1-D data stream, and store it along with some metadata to describe how to unlinearize it.

What's wrong with array of arrays? It does has random access (arr[5][10]), it can have contiguous memory (as you'd allocated). And they couldn't be useless for them as they represents the same data structure as your ND-arrays does. Or you'd like to say that array of arrays of float isn't equivalent as a data to 2D array of floats?

So what ND-arrays trying to solve? I already noticed that they don't try to optimize anything, so there should be any other purpose.

It's trying to make it so that when applications linearize their data it remains standardized and that tools can convert those arrays to/from json/xml/other and still retain the ND-array structure.

Again, what does this solves? Data remain linearly in UBJSON, but for application it's still ND-array and it's not linear. So you didn't change anything, but UBJSON format. So why?

Are we going to push UBJSON into Matlab?

http://ubjson.org/libraries/

I was mean that UBJSON goes to be a part of Matlab. So we just trying to solve some specific case which makes only sense for Matlab library?

oblivious design flaws for handling realworld data.

The only way JSON is 'flawed' is in the way it handles number-heavy arrays of data. For strings and objects JSON is about as good as you can reasonably get. A binary format's strength is the way it handles number-heavy arrays of contiguous data efficiently. UBJSON in my opinion steps in to fill in that gap and fill in that use case for people who have large number-heavy streams.

JSON flawed in many ways: binary data, unicode text, arrays of repeatable data, lack of streaming, awkward number definition. Hopefully most of these problems UBJSON is already solved.

oblivious design flaws for handling realworld data.

If realworld use cases for large number-heavy streams includes mostly N-D array data that has been linearized down by the application and stored with some metadata, then having a standard way to do that metadata is worth discussing to increase interoperability and common client use cases.

You still didn't provide any real case about. That's really cool that you can turn pair of coordinates into 2D array. If I take Erlang, there I'll say thank you for providing nicer way to encode/decode proplists. There cases aren't too small or language specifics to make a real sense, especially from point of JSON data.

UPDATE: however, on second though, I'm not sure is worth to use 2D array for proplists or just streaming object.

Steve132 commented 9 years ago

And you believe that if you'll pass them ND array instead of 1D binary stream they'll work? You have any API examples of such?

Yes.

glTexImage* family of functions, and the equivalent in DX

All of matlab functions. (in particular, fread in matlab will do this)

All of numpy functions (in particular, np.array.fromfunction will do this if you provide a generator)

FFTW

https://github.com/nothings/stb/blob/master/stb_image_write.h

http://en.wikipedia.org/wiki/MeshLab

OpenCL

kxepal commented 9 years ago

https://github.com/fangq/jsonlab/blob/master/saveubjson.m#L246 As far as I can tell, mat2ubjson(randn(50,50)) outputs something like ...

Oh, wait. That's not UBJSON issue at all. In the same way I could turn any STC array into object like this with any language just because so I wrote the encoder. I'll check the code for more info, thanks for the pointer.

Steve132 commented 9 years ago

. In the same way I could turn any STC array into object like this with any language just because so I wrote the encoder.

Yes, you could. However, you usually would not because there's no need to. That's my point. If you want to represent a 1-D array, we have a standardized way of doing that that would make writing all the custom metadata pointless. No encoder would do that.

However, as you can see, because we don't have a standard way of representing an N-D array, the implementer of the encoder had to invent his OWN way of encoding the metadata wrapping around the 1-D linearized version of the data. Doing this created a non-standard wrapper that other programs would choke on (even if they accepted N-D arrays).

Furthermore, this non-standard wrapper required more space to encode.

kxepal commented 9 years ago

However, as you can see, because we don't have a standard way of representing an N-D array, the implementer of the encoder had to invent his OWN way of encoding the metadata wrapping around the 1-D linearized version of the data. Doing this created a non-standard wrapper that other programs would choke on (even if they accepted N-D arrays).

I strongly believe that the motivation was something different or this man likes overengeneering or something else. Need to check reasons why for such solution.

P.S. It's late for me to continue our conversation and keep focus on all the details. Thanks a lot for the brainstorming and various references, I'll check them all on the weekends. Hope I was not very annoying today (:

Steve132 commented 9 years ago

I strongly believe that the motivation was something different or this man likes overengeneering or something else. Need to check reasons why for such solution.

The reason for the solution is that if you want to serialize an N-D array into a linear format, you have to keep the dimensions around in the stream otherwise you can't reconstruct it. You basically HAVE to do it this way :). It's how I would do it at least.

Hope I was not very annoying today (:

Same! I'm not trying to come across as belligerent or aggressive, just passionate and I have done a lot of thinking on this already :)

meisme commented 9 years ago

What's wrong with array of arrays? It does has random access (arr[5][10]), it can have contiguous memory (as you'd allocated). And they couldn't be useless for them as they represents the same data structure as your ND-arrays does. Or you'd like to say that array of arrays of float isn't equivalent as a data to 2D array of floats?

An array of arrays would look like this

[[]
    [[][$][U][#][3][1][2][3]
    [[][$][U][#][3][4][5][6]
    [[][$][U][#][3][7][8][9]
[]]

If you want to feed the data from the ND array to some other code (expecting either an ND array or a 1D stream), you'll have to rewrite it to another memory position so you can get it continuous without the [[][$][#] (or [[]...[]]). With the proposal, you would have

[[][$][U][@][2][U][3][U][3]
    [1][2][3]
    [4][5][6]
    [7][8][9]

If you have the UBJSON in memory (or a memory mapped file), you can simple pass a pointer to the 10th byte.

I absolutely support there being a way of expressing an N-dimensional array with continuous data in UBJSON. I'm just wondering if this is the way to do it, or if it would make more sense to use a typeschema definition like we've talked about elsewhere so you can express an array with mixed types in continuous memory as well.

Miosss commented 9 years ago

It requires yet another 'type' overload for [] which may rub some people the wrong way, when we already have code to handle '$' or '#'

I think that all container-optimizing proposals are more of a 'optimization optin' rather than semantics overaload of the container type.

STC header does not change the meaning of the array itself, it just gives pre-parsing hint to what actually is stored in this particular array.

If ND proposal shall be accepted, than I would see it as an option, but still, regarding array of arrays

Question: Is your ND optimized array like:

[[]
    [$][i][@][2][i][2][i][3]
    [11]
    [12]
    [13]
    [21]
    [22]
    [23]
    [31]
    [32]
    [33]
[]]

is semantically equivalent to:

[[]
    [[][$][i][#][i][3]
        [11]
        [12]
        [13]
    []]
    [[][$][i][#][i][3]
        [21]
        [22]
        [23]
    []]
    [[][$][i][#][i][3]
        [31]
        [32]
        [33]
    []]
[]]

? I mean do you see ND array as an 'metadata hint', or the parses shall indeed produce array of identical arrays (same number of elements and its types)?

Or is it identical to:

[[][$][i][#][i][9]
    [11]
    [12]
    [13]
    [21]
    [22]
    [23]
    [31]
    [32]
    [33]
[]]

so to the flat 1D array? I mean in the format that parser should produce.

kxepal commented 9 years ago

If you have the UBJSON in memory (or a memory mapped file), you can simple pass a pointer to the 10th byte.

What the sense to pass a pointer to some arbitrary byte number? Work with ND-arrays assumes passing matrix coordinates like arr[1][2][3], not an single byte number like arr[6]. If you work with ND-array as with 1D-array then you probably picked wrong data type from start. Why UBJSON should try to fix such mistakes?

I'm just wondering if this is the way to do it, or if it would make more sense to use a typeschema definition like we've talked about elsewhere so you can express an array with mixed types in continuous memory as well.

I see the problem that this proposal closes the doors for typeschema definition for mixed typed containers: there is no way to add this without changing ND-array header so it'll again break the compatibility. From my position, the solution shouldn't be too specific, but general by own nature to handle other problems which UBJSON / JSON faces today.

Miosss commented 9 years ago

@kxepal Do I correctly understand your conception here:

the 3bytes image of 1280x720 could be also described as (this is NOT block notation, too many braces ;) ):

[#1280[#720[#3[i]]] <- typespec 
1 <- first byte
2 <- second byte
3 <- third byte -> first RGB triple ended
4
...
]

or in block notation:

[[]
    [#][I][1280] <- 1280 elements of type:
        [[][#][I][720] <- array of 720 elements of type:
            [[][#][i][3] <- array of 3 elements of type:
                [U] <- byte
            []]

        []] <- typespec ends

[1]
[2]
[3]
[4]
...
[]]

I modified a bit your proposal I think, because I specified [#] multiple times to ensure constant size of each dimension. So it basically became:

array with 1280 elements of type: array with 720 elements of type: array of 3 integers

I think that something like that produces the same description as @Steve132 's original example.

I just do not inderstand, why such generalized construct like typespec of course discussed more and maybe changed in few places (polished) is not suitable for @Steve132. I think that ND array is just edge-case of what @kxepal proposed, end the latter's proposal is much more generic (allows objects optimizations etc.) while still preserving STC compliancy.

And it is so even without repeatable-headers (which are not neccesary for this I think).

The parsing of such header is only a bit harder I think (due to more options in type spec).

Steve132 commented 9 years ago

I mean in the format that parser should produce.

What data types the parser returns to the client is an implementation detail that is a part of the API choice for the parser, not part of the spec. The reason is that what data types are available is language and API dependant.

If I was implementing this, in python I'd return a numpy array. In UBJ it returns a linear array with an augment of metadata (or, more accurately, the UBJ array type is always an N-D array in contiguous memory, but on $# the array size is Nx1x1x...). If you WANTED to return this as native arrays of arrays in the API implementation there's no reason you couldn't in the spec.

Work with ND-arrays assumes passing matrix coordinates like arr[1][2][3], not an single byte number like arr[6]. If you work with ND-array as with 1D-array then you probably picked wrong data type from start. Why UBJSON should try to fix such mistakes?

No, it doesn't. In pretty much ANY actual work with ND-arrays they implement them this way (as contiguous linearized 1-D arrays). Matlab does it, numpy does it, LAPACK does it, BLAS does it, OpenCL does it, images do it.

NO actual applications or libraries I have ever worked with do what you said (using native N-D arrays). The reason is simple: it's WAY slower and harder to work with and the memory isn't guaranteed to be contiguous.

I think that something like that produces the same description as @Steve132 's original example.

I just do not inderstand, why such generalized construct like typespec of course discussed more and maybe changed in few places (polished) is not suitable for @Steve132.

Because it introduces a LOT of weird corner cases and lookahead rules that make correct parsing a LOT more complicated, introduces a LOT of overhead in the parser for performance, is a DRAMATIC change over previous versions of UBJSON in the code (vs this change which is just few extra lines in the header parser), and is completely unlike not just JSON, but also unlike XML or any other format that we would likely want tools compatibility with.

It doesn't even save us that much space...In the example https://github.com/thebuzzmedia/universal-binary-json/issues/61#issuecomment-66746984 of using typeschema for a 3-D array its MANY MANY more bytes than this proposal

-4S+1E-5U-1P

Steve132 commented 9 years ago

Small correction that bugged me,

[[]
    [#][I][1280] <- 1280 elements of type:
    [[][#][I][720] <- array of 720 elements of type:
        [[][#][i][3] <- array of 3 elements of type:
        [U] <- byte
        []]

    []] <- typespec ends

[1]
[2]
[3]
[4]
...
[]]

Would actually have to be

[[]
    [#][I][720] <- 720 elements of type:
    [[][#][I][1280] <- array of 1280 elements of type:
        [[][#][i][3] <- array of 3 elements of type:
        [U] <- byte
        []]

    []] <- typespec ends

[1]
[2]
[3]
[4]
...
[]]

Because the data I was representing originally was a x-major order array of pixels...so it would be 720 rows of [1280 pixels of[3 bytes] if we used this convention

meisme commented 9 years ago

What the sense to pass a pointer to some arbitrary byte number? Work with ND-arrays assumes passing matrix coordinates like arr[1][2][3], not an single byte number like arr[6]. If you work with ND-array as with 1D-array then you probably picked wrong data type from start. Why UBJSON should try to fix such mistakes?

In low level programming (C/C++) a pointer is equivalent to an array (and the way you pass arrays between functions). The [] operator just calculates offset from the initial pointer. And there's not much difference between an ND array and a 1D array. arr[y][x] means pretty much the same as arr[y*Y_LEN + x]. The first notation only works for arrays with known size at compile time.

I see the problem that this proposal closes the doors for typeschema definition for mixed typed containers: there is no way to add this without changing ND-array header so it'll again break the compatibility. From my position, the solution shouldn't be too specific, but general by own nature to handle other problems which UBJSON / JSON faces today.

Yes, the choice is between this and a typeschema way of doing it. You can't have both.

Miosss commented 9 years ago

Because the data I was representing originally was a x-major order array of pixels...so it would be 720 rows of [1280 pixels of[3 bytes] if we used this convention

Well this is irrelevant to the concept behind the example.

What data types the parser returns to the client is an implementation detail that is a part of the API choice for the parser, not part of the spec. The reason is that what data types are available is language and API dependant.

So the ND in UBJSON has no fixed interpretation? ND array has this additional information, yet it is up to the implementation to decide wheter it is 3x3 array or 1x9? Then what is it for?

What is JSON equivalent of ND array?

In pretty much ANY actual work with ND-arrays they implement them this way (as contiguous linearized 1-D arrays). Matlab does it, numpy does it, LAPACK does it, BLAS does it, OpenCL does it, images do it.

So. If every language/application encodes ND array as 1D array, then why incorporate ND into UBJSON? What for, if 1D array is sufficient enough as each "known" language uses such interpretation.

What does really this ND header do? Is this just STC extended with metadata containing dimension-specification which can be discarded by parser? It does not improve efficiency at all (STC header-style does all the optimization). It adds no useful information (no direct translation to JSON; no precise description of resulting type (3x3, 1x9, ...)).

0S-1E+0U+0P

meisme commented 9 years ago

So the ND in UBJSON has no fixed interpretation? ND array has this additional information, yet it is up to the implementation to decide wheter it is 3x3 array or 1x9? Then what is it for?

The interpretation is clear, it's a 3x3 array. There's no question about that.

A library API can choose to offer the data as a 1x9 if it has a special need to do so, but that doesn't have anything to do with the data interchange format itself. Just like a python library is likely to treat all integers as int and all floats as float.

What is JSON equivalent of ND array?

2D is array of array, 3D is array of array of array and so on and so forth.

So. If every language/application encodes ND array as 1D array, then why incorporate ND into UBJSON? What for, if 1D array is sufficient enough as each "known" language uses such interpretation.

The applications don't encode an ND array as a 1D array. They store the ND arrays in continuous memory.

The reason to support it is to add interoperability. If there's no way to implement an ND array in continuous memory in UBJSON, then each of these implementations will linearise and store the meta data in a library specific way. The reason this is wrong is because that linear array is secretly an ND array, but there's no standard compliment way of expressing that without paying for it performance wise.

What does really this ND header do?

It allows you to define a multidimensional array in continuous memory.

Is this just STC extended with metadata containing dimension-specification which can be discarded by parser?

Any metadata information can be discarded by a parser if it makes more sense for that spesific application. We're not designing a standard library. All the STC metadata is discarded when you convert to JSON, for instance.

It does not improve efficiency at all (STC header-style does all the optimization).

It does improve efficiency. As it stands, if a parser needs ND array data and wants to create valid UBJSON that is readable by other implementations, it will have to represent it as an array of an array. When it interprets that again, it will have to copy that data to a new memory location to get it continuous.

Additionally, storing the data as per Draft 11 looks like this

[[]
    [[][$][U][#][U][200]...
    [[][$][U][#][U][200]...
    [[][$][U][#][U][200]...
[]]

While with the new proposal it would look like this

[[][@][2][U][3][U][200]
    ...
    ...
    ...

Which means that you save at least (∏ n from 1 to (N-1) of D(n))*5 bytes (where D(n) is the size of the n-th dimension), making it more space efficient as well.

It adds no useful information (no direct translation to JSON; no precise description of resulting type (3x3, 1x9, ...)).

I've already addressed this, it's not true.

Miosss commented 9 years ago

The interpretation is clear, it's a 3x3 array. There's no question about that.

What data types the parser returns to the client is an implementation detail that is a part of the API choice for the parser, not part of the spec. The reason is that what data types are available is language and API dependant.

I asked which UBJSON would be equivalent. Steve answered that this is API implementation.

My entire post is based on this answer, that this is not fixed structure but only a hint. You must have noticed that.

The applications don't encode an ND array as a 1D array. They store the ND arrays in continuous memory.

That is what I meant. Wrong pick of words.

It allows you to define a multidimensional array in continuous memory.

UBJSON is big-endian, and still many machines are little-endian. So you still have to do conversion (not just cast!) for each value in array if your parser just returns you pointer to this STC (maybe ND) array inside UBJSON message. Am I right?

It does improve efficiency. As it stands, if a pars ...

It does only improve parsing of STC nested arrays of numbers. Typespec does precisely the same thing and much more.

Because it introduces a LOT of weird corner cases and lookahead rules that make correct parsing a LOT more complicated,

ND implemented in typespec is as easy as your proposal,

introduces a LOT of overhead in the parser for performance

In performance? No. Complexity? A bit. You have more general structure to parse, but typespec is also rather for bigger messages (like you scientific, huge arrays) rather than for small objects. Than performance is not dramatically lower.

, is a DRAMATIC change over previous versions of UBJSON in the code (vs this change which is just few extra lines in the header parser)

Not a change. Extension. The only change that it could be is to replace STC as they are know with typespec. But this is not the right issue to discuss.

and is completely unlike not just JSON, but also unlike XML or any other format that we would likely want tools compatibility with.

Well XML is not a reference here. And you still do not get one crucial point. typespec is binary encoding optimization. It does not change semantics of resulting message at all. It is just like STC. STC allow only one type to be specified, and typespec allows mixture of types (and also nesting containers etc.). So typespec-described message has precisely one translation to UBJSON from Draft11 so to JSON as well.

You still try to optimize STC, and in fact only STArray. They are great even now. But only for binary/numbers/bools. JSON uses mostly objects. Typespec gives you chance to have STC's, ND STC's, even of single type as you want. But they also give many more opportunities to, not only optimize your large matlab/python collection of numbers, but objects as well. Your ND does almost nothing for objects.

meisme commented 9 years ago

I asked which UBJSON would be equivalent. Steve answered that this is API implementation.

My entire post is based on this answer, that this is not fixed structure but only a hint. You must have noticed that.

You asked what the parser should produce, which could mean both the API and the UBJSON. I believe there was simply a misunderstanding between you there. I did notice that, but I decided to give a full reply to your post anyway since you did bring up some other points I could address.

UBJSON is big-endian, and still many machines are little-endian. So you still have to do conversion (not just cast!) for each value in array if your parser just returns you pointer to this STC (maybe ND) array inside UBJSON message. Am I right?

Yes, that's true, and a very good point. x86 and x64 are little endian, are they not?

It does only improve parsing of STC nested arrays of numbers. Typespec does precisely the same thing and much more.

Yes. Which is why the question is "do we need typespec (with their increased complexity), or is this sufficient?"

Miosss commented 9 years ago

Yes, that's true, and a very good point. x86 and x64 are little endian, are they not?

Yes they are. According to wikipedia, intels, amds, many ARMs, etc. are little endian, while OS X and network byte orded (and so UBJSON) is big-endian. Why the endianess of the network is big while most popular platforms are little is beyond my knowledge. Therefore conversion shall be done quite often if speaking about desktops etc.

Which is why the question is "do we need typespec (with their increased complexity), or is this sufficient?"

Yes I feel it, but discussion is spread through so many issues. Some reorganization would be great.

Steve132 commented 9 years ago

Yeah I'd like to clarify that what it represents is unambiguous in the format. Its an N-D array.

What I thought you were asking is "what data type should the parser return after a parse" which is implementation defined

ghost commented 9 years ago

My thanks to @Steve132 for formalizing his thinking here in a really concise way to @kxepal , @meisme and @Miosss for a very in-depth discussion.

I am starting to see a very clear divide in our ability to move the format forward with arguments falling into 1 or 2 buckets:

The repeating header discussion felt like a clear-as-day decision to me, until we started to consider the high performance processing considerations that @Steve132 pointed out time and time again.

Same goes for this discussion... the constant mantra of "contiguous memory, contiguous memory".

I get it, I definitely get it, but I also feel like we are at a point where a decision needs to be made - either UBJSON is going to be an optimized, big-data format with all sorts of binary data optimizations and hangs out at the same bars that Protobufs does or it's going to be a more general case/flexible language who is best friends with JSON and XML and they hang out on the weekends.

I'll be honest, when I initially defined UBJSON, my heart always intended it to be more inline with JSON and XML - sort of a "JSON++" -- it is only after @Steve132 showed up with his fancy hair cut and Italian leather shoes that I realized the fit UBJSON had in the data world -- this wasn't something I saw happening initially.

Damn my eyes, but I still (in my gut) feel like repeatable headers, the [C] marker and supporting prefixed container-schemas are all good ideas.

@Steve132 I've read your proposal, twice, and it's brilliant... but I don't know if this is how I intended the spec -- it feels "out of character" in a way.

I also feel like there IS a way to make those more flexible changes I mentioned (repeatable headers, and prefixed schemas) and still allow you to efficiently represent your data, but I get this sense that you have a very specific way of how you want this modeled and so this seems more like a binary decision than I think it really is. But notice I used "I think" -- I could be wrong, this might be as black and white as it seems...

I have been mulling this over since the discussion started and am going to continue to do so through the holidays... I need to go back to basics on paper and see if I can come up with something that solves both the heart and intent of UBJSON along with the real-world, larger scale requirements.

I want to reiterate very loudly here though THANK YOU to all of you... we all have a very limited view of the world on our own, it is only together that we really can come together and start to understand what the elephant looks like.

AnyCPU commented 9 years ago

Hello, guys!

I have read this discussion. At the moment I working a little bit with Postgre binary protocol. Binary protocol used in a transport layer. What I could be say. Postgre have a integers (in BigEndian), n-length strings and other simple types that very similar to UBJSON.

Also Postgre have mechanism of user defined/registered types. For such types a user can define a own "code/encode" function. So, Postgre have a base layer of binary protocol with extension compatibility.

About arrays in point of view from Postgre. At some day, for example, db driver may produce a OutOfMemory error or something else. Memory usage optimization can be done in very specific "code/encode" function. Transport and Store representations can be very different in such case.

Main idea is that UBJSON must act as base layer. First versions of UBJSON was pretty enough in terms of efficiency, output size.

Regards, Michael.

                  =
kxepal commented 9 years ago

@AnyCPU hi! Have you look on their binary JSON implementation?

AnyCPU commented 9 years ago

Hello, @kxepal )

Short answer - no.

Long answer.

UBJSON document can be easily stored in Postgre. Requirements for basic support:

UBJSON can be applied to a transport layer, existing types and used as first-class type.

Postgre is extensible and simple enough.

I have done a bridge between internal type system and Posgtre binary protocol in commercial product.

So, everything is possible.

Regards, Michael.

                  =
kxepal commented 9 years ago

@AnyCPU sound cool. hope you can open source your solution - that would be awesome!

Steve132 commented 9 years ago

it's going to be a more general case/flexible language who is best friends with JSON and XML and they hang out on the weekends.

I actually like that it's already basically this. Like I said leading up to this, I'm very very happy with UBJSON as of standard 12. I think it's perfect. I just like this one feature.

However, I'd like to point out one thing: This is what @kxepal said is his opinion for what the purpose of UBJSON is

the goal for UBJSON is to provide a binary format which is in transitive relations with JSON and compatible with it while solving his oblivious design flaws for handling realworld data.

I actually very much agree with his answer. The question is this: What "design flaws" is he referring to? What real-world data is he referring to? From what I can see, JSON only has one design flaw relating to data: it is inefficient at handling large numerical datasets. In ALL other respects, JSON is already basically optimal. UBJSON does not beat it significantly for object encoding, for string encoding, or for any other kind of non-numeric encoding. The only "design flaw" I can see in JSON that UBJSON does better is the fact that it is binary allows static-sized numeric values to be efficiently stored and parsed. Large numeric data is the ONLY kind of "real-world data" that UBJSON wins on vs. JSON, and JSON is already significantly entrenched in standard library implementations, in tools, in interchange formats, in REST APIs...so unless you need large-static sized numeric values, there's no reason not to use it.

If we walk into the bar where JSON,UBJSON, and XML are hanging out, and we ask "Which one of you should we use?" then people who need markup (mostly unstructured data with some tags) will choose XML, people who have large amounts of scientific data will chose UBJSON, and everyone else will chose JSON (and they should) because it's objectively better than the alternatives (including UBJSON) if you don't have scientific data and have lots of strings and objects and records and want to interface with databases and websites.

I also feel like there IS a way to make those more flexible changes I mentioned (repeatable headers,

Repeatable headers cannot by definition exist with static length containers without performance penalties. Repeatable headers == dynamic length. Fullstop. Now it might be a tradeoff worth making but I'm fairly confident about this :).

and prefixed schemas) and still allow you to efficiently represent your data,

Prefixed schemas would work fine and be efficient but be fantastically complicated and be a radical departure from the status quo, which I already believe is very good. It would also be a radical departure from JSON so much so that I don't think we could have JSON in the name. Maybe UBSCHEMES is a better name?

but I get this sense that you have a very specific way of how you want this modeled and so this seems more like a binary decision than I think it really is. But notice I used "I think" -- I could be wrong, this might be as black and white as it seems...

Nope, I don't have any marriage to anything except simplicity and efficiency. I got involved with UBJSON and started using it and wrote the C implementation because the status quo had simplicity and efficiency them in abundance and I was impressed with it. I'm not married to a particular data format, but I really really like what UBJSON is now and I'd like to avoid messing with a format that works. I wanted to add the N-D array proposal because it's the smallest possible proposal change that adds the most-important feature to our largest use case. It's small because it changes no datatypes, no parser steps, no datastructures in the parser...just a small change to the header for metadata. It's the most important feature because N-D data is EXTREMELY common, the most kind of data in the world, and the point of JSON is interoperability, and its to our largest use case, because like I already explained, large binary numeric datasets are our primary audience because they are the only compelling reason to actually use UBJSON over JSON.

I have been mulling this over since the discussion started and am going to continue to do so through the holidays... I need to go back to basics on paper and see if I can come up with something that solves both the heart and intent of UBJSON along with the real-world, larger scale requirements.

:D

kxepal commented 9 years ago

@Steve132 , no need to speak about me in third person. I'm not dead, but here and listeting (:

the goal for UBJSON is to provide a binary format which is in transitive relations with JSON and compatible with it while solving his oblivious design flaws for handling realworld data.

I actually very much agree with his answer. The question is this: What "design flaws" is he referring to?

You probably deal too much with JSON (: Here is my list:

  1. No integers, no floats, only numbers. Should you use float type to handle number 0? How would you compare next two objects: {"a": 0} and {"a": 0.0} - are they equal or not? And why? And from which point of view?
  2. Mess with strings. By following JSON spec all characters which are not explicitly defined as allowed should be encoded in \uXXXX escape notation. However, people and libs are tends to violate this rule by using raw UTF-8 strings instead (or some other codepage). And this behaviour (for UTF-8 case) de-facto became a standard, so your library which strictly follows the spec might fail in real world.
  3. Undefined behaviour for repeating similar object keys. What should you do for the object {"a": 1, "a": 2} - throw an exception, allow last definition win, allow first definition win, keep both? All the cases are valid, that's fun.
  4. Not possible to work with containers as with stream. Say you have a huge array of objects, or just a big object of some data and you want to decode it. In fact, you have to read all it into memory before start decoding. That's may hurt. Especially if you need to send such data over network. There you have two strategies: all-or-nothing, praying that connection to be stable and reliable (ha-ha-ha) or send it by chunks with merge on the other side. For array-of-objects case there is more-or-less elegant solution is inventing line-based protocol and stream each array element one by one. In anyway, you have to end with hacks and workarounds to make it works with less footprint. I'm aware about ijson, but that's still a hack and non standard, but much more complicated, way to work with JSON.
  5. No binary data support. No comments are need for that.

Speaking about application layer there are no oblivious ways how to encode datetime and decimal values, how to deal with data-tables, array of objects etc.. If you dealing with web, then you should also care about number overflow on JavaScript side and protect you big nums by passing them as strings. This causes everyone to use their own solution, inventing own "forks" of the spec with well known fate.

Someone could tell you that lack of annotations / commentaries and multiline strings are the flaws too, but in most cases these people are trying to use JSON for configuration files. They are doomed already, no need to waste a time on them.

What real-world data is he referring to?

REST APIs and just JSON APIs, various JSON-related specs (JSON-LD, JSON-Path etc.). If you need some specific example let's take a look on JSON-RPC / REST API. Here we have two kind of things: collection and element. An element is a some object with some set of fields, let's pick up this nice example. A collection is an array of some similar type things: it could be list of strings, numbers, or elements. The last one is interesting: what could be easy to just return just an array of objects like one above right? Well, if you need to return thousands of them and process them fast you have a big problems.

As I noted, JSON lack of streams, so if you'll try to fetch 100500 TwitterTimeline examples you'll need: 1) enough of time to fetch the data 2) enough of memory to process it. To fix point 1) you'll have to invent some chunking or pagination for your API which means: a) step away from standard JSON API or b) force your clients to make more requests up to your service -> force your service consume more resources (pagination doesn't saves you from OOM issue) -> degrade on providing quality of service.

After inventing line-based protocol all these issues are gone away and you seems happy, but it's not the end: your service uses too much traffic to serve your clients. Traffic costs money, more data you sends - more time is need to receive it. Compression is a solution, but everyone will have to pay with CPU resources for that. Then you're going to start optimize your and what do you see? For your timeline collection you sends every time for every element the same object keys! Why not to send them only ones and then send only the actual data? This safes a lot of bytes, but this decision is hard to make since it need to completely change the output format. In fact, if we're going to change the format, why not to pick something better, something binary and compact one? So on this step most of people goes away from JSON to MessagePack and similar formats and becomes completely happy. And btw processing MessagePack is more faster and cheaper than JSON.

--- snip ---

That's why I disagree with you about Large numeric data is the ONLY kind of "real-world data" that UBJSON wins on vs. JSON - that's completely not a true. As like it's not a true that ND-arrays are EXTREMELY common in the real world. Those examples that you gave me before are good, but represents quite low level API of specific data domain or affects to the specific data like GPS coords. The real world is a bit more diverse.

--- continue ---

It would also be a radical departure from JSON so much so that I don't think we could have JSON in the name. Maybe UBSCHEMES is a better name?

Headers with container schema is a sort of optimization, nothing more. It's not much radical departure from JSON then ND-arrays, strongly typing, unbounded containers and utf-8 strings (: This feature doesn't tries to take a place of Protobuf (while such types-contract feature is good side effect), but optimize the very common case for every API: collection of similar elements. The repeating of these headers is need to fix the numbers overflow issue, but as a side effect it gives you feature to optimize containers with mixed data. Pretty use-case good example: streams aggregation. On input comes multiple streams of different, but similar per stream, objects and all they passes to the client in the single stream. Oblivious real world example: cluster with front-end proxy.

I wanted to add the N-D array proposal because it's the smallest possible proposal change that adds the most-important feature to our largest use case. It's small because it changes no datatypes, no parser steps, no datastructures in the parser...just a small change to the header for metadata.

Well, ND-array feature causes changes in: 1) Handling header changes 2) Creating special context about. Since you need to generate ND-array output, not something else. You cannot just expand ND-array header as a macro into plain old typed and sized array. 3) Process the data in the defined way: read N elements; apply context; shift dimension; yield array; repeat. 4) It may be used in evil the way of zip bomb if you're doing preallocation, so you have to apply additional restrictions to the header causing spec violation.

Steve132 commented 9 years ago

Since you need to generate ND-array output, not something else. You cannot just expand ND-array header as a macro into plain old typed and sized array.

Yes, actually, you can. Check the code I have in ubj for the array type, this is exactly how I implement it.

Process the data in the defined way: read N elements; apply context; shift dimension; yield array; repeat.

No you don't have to do this. You just input/output the entire array like you would the linear array case. It's only a metadata change.

4) It may be used in evil the way of zip bomb if you're doing preallocation, so you have to apply additional restrictions to the header causing spec violation.

I don't know what you're talking about here...unless I'm confused, this concern would equally apply to the linear size-case...this proposal doesn't change that.

Not possible to work with containers as with stream. Say you have a huge array of objects, or just a big object of some data and you want to decode it. In fact, you have to read all it into memory before start decoding. That's may hurt. Especially if you need to send such data over network.

I'm not saying you're wrong here, but could you clarify this a bit? I've seen and implemented streaming one-pass JSON parsers that work over sockets and arbitrary io streams that don't need to completely parse the entire array into memory...I'm possibly confused about this.

I still feel like my point still stands that, if a user of a format wants to know which format would be the best to use for REST data and web apis with lots of structured objects and strings, JSON is still the best choice vs UBJSON because even though JSON may have those few flaws with specification you've mentioned, library implementations are mature enough that most people don't deal with them in the real world (I use json OFTEN and I've never encountered them)....since those flaws don't matter much, and the size of the datastream for UBJSON and JSON are similar for that kind of data, there's no compelling reason to adopt UBJSON.

In contrast, dense numeric scientific data that is used in research, in games, in multimedia, in AI, in signal processing, in filetype storage, in machine-learning, etc...has a billion and a half different completely incompatible binary formats that are all basically just "collections of named N-D arrays of floats and ints with some metadata ". UBJSON is basically the killer app for this domain, because it would allow unifying these different applications into a common parser interchange format which REALLY REALLY excites me.

kxepal commented 9 years ago

Since you need to generate ND-array output, not something else. You cannot just expand ND-array header as a macro into plain old typed and sized array.

Yes, actually, you can. Check the code I have in ubj for the array type, this is exactly how I implement it.

Well, your implementation is a bit different from the proposal. You assumes very small amount of dimensions, for instance and you have a special structure which is ready to work with dimensions. Now try to make the same with plain old common arrays without custom types (: That's the case for scripting languages where native structures are preferred over custom ones.

Process the data in the defined way: read N elements; apply context; shift dimension; yield array; repeat.

No you don't have to do this. You just input/output the entire array like you would the linear array case. It's only a metadata change.

That's cause an problem with invalid data: how would you handle the case when ND-array contains not enough of data per dimension? The output wouldn't be an linear array, but the same ND-array defined in language native structures.

It may be used in evil the way of zip bomb if you're doing preallocation, so you have to apply additional restrictions to the header causing spec violation.

I don't know what you're talking about here...unless I'm confused, this concern would equally apply to the linear size-case...this proposal doesn't change that.

Try to specify anomaly high amount of dimensions where each is also sized by some sort of int64 while doing preallocation all in single shot to reduce memory fragmentation. You'll likely fall early somewhere in the middle of header processing.

Not possible to work with containers as with stream. Say you have a huge array of objects, or just a big object of some data and you want to decode it. In fact, you have to read all it into memory before start decoding. That's may hurt. Especially if you need to send such data over network.

I'm not saying you're wrong here, but could you clarify this a bit? I've seen and implemented streaming one-pass JSON parsers that work over sockets and arbitrary io streams that don't need to completely parse the entire array into memory...I'm possibly confused about this.

To work with JSON as with steam you have two options: take yajl-based parser or cut JSON string into fragments. First causes serious API changes and not much people actually uses it despite the fact. The second way is more common since it plain stupid and simple, but that's sort of hacks, not a part of format specification. Compare this with UBJSON where streaming array/objects are native: everything works in standard way out of the box.

Steve132 commented 9 years ago

Well, your implementation is a bit different from the proposal. You assumes very small amount of dimensions, for instance and you have a special structure which is ready to work with dimensions.

In the proposal I specifically said that the max number of dimensions is 256. I think in UBJ I put that at 8 for no reason but I fixed it. In either case 8 and 256 are both small fixed-size constants.

That's cause an problem with invalid data: how would you handle the case when ND-array contains not enough of data per dimension? The output wouldn't be an linear array, but the same ND-array defined in language native structures.

How do you handle the case where a LINEAR array contains not enough data compared to what the STC header claims it has? Its an error, or you fill the rest of the preallocated space with zeros.

Just do the same thing here. Again, with this proposal, it piggy-backs so heavily on the linear-case that again, however you handle the linear case is however you handle this. Remember, it's only a metadata change.

Try to specify anomaly high amount of dimensions where each is also sized by some sort of int64 while doing preallocation all in single shot to reduce memory fragmentation. You'll likely fall early somewhere in the middle of header processing.

The max dimensions is 256 for this reason. I said that in the proposal specifically to counter this effect. The maximum number of bytes in the header is 256*9. In the linear case, if you try to import an $# array of size 2^63 the implementation is responsible for erroring, which this proposal has no impact on.

you have a special structure which is ready to work with dimensions.

The 'special structure' is just the array type for the linear array required in C (memory+length) with the additional metadata (memory+length+dimensions).

Now try to make the same with plain old common arrays without custom types (: That's the case for scripting languages where native structures are preferred over custom ones.

I can't imagine why this would be a problem in any language. In python numpy arrays do everything in this proposal this by default, and they are the 'native' bigdata type. However, a parser that parsed to list of list of list of ... of list would be trivial to write (if inefficient).

Here it is:

def parsetolistoflists(linearstreamgenerator,dims):
    if(len(dims)==0)
        return linearstreamgenerator.next()
    else
        return [parsetolistoflists(linearstreamgenerator,dims[:-1]) for x in range(dims[:-1])]

There's also no intrinsic reason custom types are bad either, and I can't think of a language that wouldn't make this custom type trivial. In python for example, you can make operator overloads.

This proposal does not restrict ANY of the above options to the implementor.

kxepal commented 9 years ago

I can't imagine why this would be a problem in any language. In python numpy arrays do everything in this proposal this by default, and they are the 'native' bigdata type.

UBJSON is not a format for big data, it's more general. So attempt to force usage of specific to big data world data structures isn't very good idea.

The basic use-case for ND-arrays are CSV data dumps. But that's tabular data format which isn't suitable for streaming process. However, working with CSV lines as with arrays also isn't handy, especially if it has a header where each field name is defined, so we goes to objects as the unit of data processing, not arrays.

There's also no intrinsic reason custom types are bad either, and I can't think of a language that wouldn't make this custom type trivial. In python for example, you can make operator overloads.

For operator overload you need some very serious reasons for. Most of these overloads are very strict to the domain of data they belongs to (math) or attempt of students demonstrate their language knowledge which produces almost always worst and unintuitive behaviour. Ability to do something doesn't means that you should do that.

Steve132 commented 9 years ago

UBJSON is not a format for big data, it's more general. So attempt to force usage of specific to big data world data structures isn't very good idea.

My point is that I reject your assertion that it's 'unnatural' to handle nd-array data in python or other dynamic languages, because many many existing toolkits do it, and you can easily do it with numpy, with native lists of lists, or with a custom array type, with very very little code.

The basic use-case for ND-arrays are CSV data dumps.

Respectfully, I strongly STRONGLY disagree with this. The basic use case for ND arrays is pretty much any numeric binary data ever, which as I've said is pretty much the basic use case for UBJSON (UBJSON doesn't really compress strings and objects). CSV data dumps is an extremely limited scope of UBJSON application compared to the potential.

But that's tabular data format which isn't suitable for streaming process.

If the linear $# data format is streamable as you assert, then the ND $@ data format is streamable.

For operator overload you need some very serious reasons for. Most of these overloads are very strict to the domain of data they belongs to (math)

I think overloading __call__() or __getitem__() to do nd-array lookups is incredibly common, and a very valid reason to do overloading, and is done by every single container class in Python, and is done by many other libraries, and is simple and readable. Furthermore, multi-dimensional arrays count as "(math)" in my opinion.

Ability to do something doesn't means that you should do that.

Sure. My point isn't to say that an implementor of this standard HAS to do that, but to point out that the implementor of the standard in a language such as python has a myriad of options available to them. Operator overloading is one, using existing libraries is another, using native list types is still another.

I do enjoy your comments, though, it's helped me think clearly about my intentions when I respond.