ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
116 stars 12 forks source link

Fixed-length N-Dimensional Arrays for UBJSON #61

Closed Steve132 closed 9 years ago

Steve132 commented 9 years ago

Fixed-length N-Dimensional Arrays for UBJSON

So, a LOT of the current discussions regarding dramatic changes to UBJSON that we are currently discussing seem to stem primarily from Issue https://github.com/thebuzzmedia/universal-binary-json/issues/43

In that issue, basically, as a way of handling dense multi-dimensional data inside an STC, a complex recursive schema header was proposed.

I pointed out that a simple "multi-dimensional" array marker would allow for dense multi-dimensional data to be stored in an array contiguously without the complexities of a recursive schema.

Eventually, however, the conversation devolved to debating various schemas and it was decided that debating the merits of more complicated schemas or repeating headers (or other issues) would be moved to other issue numbers.

Issue #43 was resolved by saying that "Yes, STCs Can be embedded into other STCs"

However, this resolution did not actually address the original use case of @edgar-bonet which inspired ALL of this discussion, which is the use case of being able to store dense multi-dimensional data inside STCs, which would allow him to store his ND arrays from JSON and his application efficiently.

Motivation:

Dense multidimensional arrays of contiguous numbers are incredibly useful for many of the most important use cases of UBJSON. Including image transport, data from sensors, matrices, scientific data, etc.

If this data has to have a parsing step, or has intermediate markers, then it becomes much more difficult to use as a block of memory or use as a mem-mapped array in a file.

Including an optional marker to specify multi-dimensional data would allow clients to preserve the structure of multi-dimensional data in JSON or other markup formats while also allowing for extremely efficient parsing and use.

Proposal:

This proposal builds on draft 12 (current draft with nested STCs). No other drafts should be considered.

Instead of a '#' marker, a '@' marker may be found in the stream for a container header. If a '@' marker is found, the next byte is an 8-bit integer describing the number of dimensions that container has. (This is intentional, as a 257-dimensional array seems absurd on its face and making this number huge could be a way for a malicious data producer to attack an implementation. Making this number always 8 bit saves us one byte and no serious application will ever have a >255-d array).

Following the dimensionality indicator number (D), there will then be D UBJSON integer values (dim[0]->dim[D-1]) that give the dimensions of the array.

The dimensions are ordered in 'increasing' order...meaning, that dim[0] describes how many values to skip in the linear data to get to the end of the first line, dim[0]*dim[1] is the number of values to skip in the linear data to get to the end of the first 'square', dim[0]*dim[1]*dim[2] is the number of values to skip to get to the end of the first 'cube', etc.

Example: A 720p bitmap image of 8-bit pixels

[[][$][U][@][3][i][3][I][1280][I][720]
    [23][23][42] [23][13][12] [13][13][42] ...Repeat 1280 sets of 3
    [43][43][42] [43][43][12] [23][53][32] ...Repeat 1280 sets of 3
    ...Repeat 720 rows.

...no closing marker needed

It's important to note that the binary representation of the data is just DIRECTLY the binary representation of the data that is expected by most image applications.

One can embed these kinds of structures and large matrices directly into scientific programs or memory. They can be loaded efficiently with NumPy. They can even directly be memory-mapped for big scientific datasets.

This can be added easily and simply to existing parsing code by simply computing the size of the STC array as the product of all the dimensions, and then calling the code associated with a linear fixed-length STC as normal to parse the data into a linear array.

Advantages:

By embedding support for multi-dimensional arrays directly into UBJSON, we automatically standardize the protocol for multi-dimensional data across applications that use UBJSON.

Applications that require multi-dimensional data will not have to develop (possibly non-standard) application layers to support it, thus increasing compatibility over many different kinds of data formats.

For example, without this proposal, matlab's UBJSON implementation might choose to encode a multi-dimensional array in an application-specific way like this:

[{]
    [i][4]["Dims"]
        [[][$][I][#][3][3][1280][720]
    [i][4]["Data"]
        [[][$][U][#][2764800]
            [23][23][42][23][13][12][13][13][42]....
[}]

Whereas Numpy or an Imagemagick writing tools might choose to encode a multidimensional array in it's OWN way

[{]
    [i][8]["channels][I][3]
    [i][7]["columns"][I][1280]
    [i][4]["rows"][I][720]
    [i][4]["pixels"]
        [[][$][U][#][2764800]
            [23][23][42][23][13][12][13][13][42]....
[}]

It's clear that these applications data-types should be default compatible. Although we cannot FORCE them to use this feature, if ND-arrays were supported natively in UBJSON then applications would be more likely to use them, thus increasing the overall compatibility of many applications.

It also has an advantage in that it allows us to support VERY efficient conversions to/from JSON or other markup formats that use multi-dimensional data, WITHOUT changing the data structure or requiring application-level schemas like the ones defined above.

In @edgar-bonet's case, he wanted to have an automated tool recognize [ [23.1,32.0],[24.1,54.0] ] and convert it properly to an efficient STC in contiguous memory with no extra markers and back if converted to/from JSON. This use case was never resolved.

Disadvantages:

It requires yet another 'type' overload for [] which may rub some people the wrong way, when we already have code to handle '$' or '#'

It adds a small amount of code complexity.

Arguably the data structure should be an application-level decision.

There's no direct "N-D array" type in JSON, only arrays of arrays.

It's not at all clear what should be done in the case of an object. Up till now the headers for arrays and objects have been unified. They should obviously still retain shared code to parse the header, but that leaves an open question about what to do if the parser encounters [{][$][S][@][5] What does a 5-dimensional key-value store even mean? More on this later.

Implementation notes:

The array type of the parser can still internally store a linear memory, but have overloaded index operators to get the data if it is stored linearly internally (like this, in C)

size_t ubjr_ndarray_index(const ubjr_array_t* arr, const size_t* indices)
{
    //multi-dimensional array to linear array lookup
    size_t cstride = 1;
    size_t cdex = 0;
    uint8_t i;
    uint8_t nd = arr->num_dims;
    const size_t* dims = arr->dims;
    for (i = 0; i<nd; i++)
    {
        cdex += cstride*indices[i];
        cstride *= dims[i];
    }
    return cdex;
}

Or like this in python

def __call__(self,*indices):
    cstride=1
    cdex=0
    for k,v in enumerate(self.dims):
        cdex+=cstride*indices[k]
        cstride*=v
    return self.data[cdex]

u=ubjson.Array()
u(13,42) #n-d array

In FACT, in python there is numpy that automatically does this, so if you want to use numpy, then the parser would just look like

ndims=ubjson.readByte()
dims=[ubjson.nextInt() for range(ndims)]
np.array.fromfunction(ubjson.scanLinearArrayData(prod(dims)),tuple(dims),dtype=int)

If the parser API is a streaming API, then just stream the linear data as normal.

What about Objects?

As I said, it is an open question what to do with objects.

There are three possibilities I see:

The first possibility is to interpret a 'multi-dimensional object' as an object that expects multiple keys for each value. For example,to store geographic data, one could have two (String) keys for longitude and latitude

[{][$][U][@][2]
    [i][6]["32.423"][i][6]["231.42"][123]
    [i][6]["12.423"][i][6]["131.42"][16]

I strongly dislike this idea. It has no direct mapping to JSON and seems weird.

The second possibility is to have the @ marker basically be ignored. The dimensions are still parsed, but the product of the dimensions are interpreted as the size to use for a fixed-size linear object, just as if it was specified with '#'.

The third possibility is to make a @ marker invalid for objects.

Can this be implemented?

Yes, there is an implementation of this proposal (for arrays only) in the current version of UBJ

kxepal commented 9 years ago

My point is that I reject your assertion that it's 'unnatural' to handle nd-array data in python or other dynamic languages, because many many existing toolkits do it, and you can easily do it with numpy, with native lists of lists, or with a custom array type, with very very little code.

Ok, must admit I didn't think about that moment carefully.

The basic use-case for ND-arrays are CSV data dumps. Respectfully, I strongly STRONGLY disagree with this. The basic use case for ND arrays is pretty much any numeric binary data ever.

Sure, but if you take a look on databases around, the quite popular question is "how do I import my CSV into or dump my data out?". I'm not a fan of this format, but I also have to deal with it every time when I have to exchange data with some database-like system, or just because manager composes all the data in excel and I need to process it somehow without dealing with excel format itself. That's the real life and real problems, so this use case is familiar to me as like as to most of people.

If the linear $# data format is streamable as you assert, then the ND array format is streamable

Streamable, but by sized portions otherwise we loose all the meaning of dimension thing.

For operator overload you need some very serious reasons for. Most of these overloads are very strict to the domain of data they belongs to (math)

I think overloading call() or getitem() to do nd-array lookups is incredibly common, and a very valid reason to do overloading, and is done by every single container class in Python, and is done by many other libraries, and is simple and readable. Furthermore, multi-dimensional arrays count as "(math)" in my opinion.

Actually they don't: call() turns object into functor and makes it usable in place of functions. getitem() looks more suitable, but custom implementation gives no profit over default list one (for ND-array case). What you much likely want to do is to enhance your object with operations which are applicable to the arrays and matrixes. But that's what numpy care about.

kxepal commented 9 years ago

Ok, this issue turns over 50 comment about. Sorry, @Steve132 the use case you insist on is unfamiliar to me, my experience and experience of other people I know as like as any problematic domain I could imagine. I would suggest to both of us stop this discussion since we obliviously couldn't reach any consensus, but will continue endless discussion about the related and unrelated to your proposal details. Honestly, I'd tried to find your point and look from it on the ND-arrays feature, but failed.

I would recommend to draw the bottom line over all this discussion, resume all the points of your proposal and call everyone else to drop your own opinion without falling into another 50 comments discussion (:

Also this proposal needs to provide very solid Why? reason list with use-cases when it need and should be applicable comparing with all the worst cases. Obliviously, this optimization feature is not from common problems domain so if it will be accepted, it needs in the explanation about how and when to apply it unless we won't turn UBJSON into something like BSON (in term of format with a lot of features which I don't understand why I need them and should pay for). For instance, you, @Steve132 , gave a good example with numpy and matlab, image looks the one too - these are good, but need to disclose them in details under the Why? header. However, it's would be awesome to find something more familiar to the average UBJSON user.

Personally, I'm -0.5 on this, but won't round it to -1 since we need to hear other people opinion about. Who known, may be ND-array is really a great feature that everyone miss. But, again, personally I couldn't get it since I'm living in the world of streams and records, not ND-arrays where N is greater than 3.

Steve132 commented 9 years ago

ould suggest to both of us stop this discussion since we obliviously couldn't reach any consensus, but will continue endless discussion about the related and unrelated to your proposal details.

I agree we can let it drop. My point over all of our discussion is merely that its easy to implement a myriad of different ways, none of which are at all surprising to the API client or difficult to implement.

lso this proposal needs to provide very solid Why? reason list with use-cases when it need and should be applicable comparing with all the worst cases.

The "why" is simple, and I'll break it down.

1) The main reason to use UBJSON over JSON is because it provides an incredible performance and ambiguity-resolution benefit for numeric data. People who use numeric data NEED somethng like UBJSON because it doesn't exist for them, whereas people who use strings, records, and objects are largely going to continue to use JSON because UBJSON doesn't provide enough of a performance benefit vs the lack of integration. This is an unfortunate but true fact.

2) Among people who use numeric data, the VAST majority of them find their numeric data takes the form of large multi-dimensional static-typed arrays of known length.

3) These people have (currently) two options as of draft 12.
a) use nested STCs, which they won't do because it imposes a string performance penalty and a data-processing step and injects extra bytes into their data that they don't need.
b) use an application-level protocol to define a linear STC nested inside an object that describes the metadata, such as

 [{] 
    [i][4]["dims"]  [[][i][3][L][1280][L][720][]]
    [i][4]["data"]  [[][$][d][#][L][2764800]
        [3.2][12.3][231.2][12.32]....
[}]

We already see option b) showign up in the wild for UBJSON users

4) JSON is aimed at removing application-level protocols to structure and interpret the data. By using JSON, different applications can understand each-others data easily without knowing the protocols to encode it. This is the great strength of JSON, that it removes the need for parsers and structure encoders and just allows completely different apps to share strings and maps and arrays with no hassle. In my mind this is the great strength of UBJSON as well.

5) This proposal allows applications that use UBJSON to store numeric nd data to be default structurally compatible with no extra effort or no application-level "protocol" agreement. If you support UBJSON, you can understand an ND array.

In summary, To that end, if 1) a plurality of UBJSON users are using numeric data (because that's what it's good for) 2) a plurality of UBJSON users are using nd-array data 3) Those users will naturally develop application-specific protocols to communicate nd-dimension information metadata for a linear array (that is basically all the same kind of information anyway) and 4) UBJSON should allow compatibility between applications that are sharing the same kind of information, then 5) UBSJON should have a standard way to specify the metadata from 3) so that users are discouraged from developing their own protocol to do it on top of the spec and applications can share ND arrays transparently and simply.

I imagine a world where tools to do image processing and tools to do signal processing and tools to do compression and tools to save 3D map files with compressed images stored embedded within them as textures are all written by completely different people but can still read and write each-others data. I imagine a world where numpy, scikit, and matlab can store the configuration options for an algorithm as floats in an object along with the datasets for inputs and outputs in a single UBJSON file and read and write to them compatibly without any knowledge of each-other. I imagine a world where people use UBJSON to store the constants and domain parameters to describe different encryption algorithms and keys that they email each-other rahter than every singl program having a different binary format. I imagine the bitcoin blockchain being stored (and parsable) directly as UBJSON instead of as http://2.bp.blogspot.com/-DaJcdsyqQSs/UsiTXNHP-0I/AAAAAAAATC0/kiFRowh-J18/s1600/blockchain.png . I imagine a world where I can download the matrix of similarities between movies in the amazon database with a REST API call and it returns a giant 2x2 matrix in UBJSON and I can automatically import it into matlab or C++ or python or ruby with a single pipe command without implementing a single line of binary parsing code.

Miosss commented 9 years ago

I imagine a world where ... I doubt it to happen as I doubt there ever will only be one country, or one web browser.

The biggest issue we struggle here I think, is precise purpose, main target, for which UBJSON is designed (being designed). In fact, the more we discuss various propositions, I get less convinced what benefits do UBJSON have over JSON.

Binary data. This is by far the most interesting thing and it does work quite well now. @kxepal 's arrays of objects - seem to be the second shot. And from my perspective this one is much general construct than ND. ND in fact is a strict subset of functionalities that "typespec" give.

Therefore, in my opinion typespec is the thing to discuss, as it does include ND as well. And if typespec should be dropped at all, then Draft 12 can become standard (maybe some minor discussion around STC, and maybe even not).

Steve132 commented 9 years ago

ND in fact is a strict subset of functionalities that "typespec" give.

This is true. On the other hand, typespec is a radical departure from our current paradigm, is a radical departure from JSON's paradigms, is a radical departure in terms of implementations (much of current implementations will have to be redesigned to accommodate it)...and depending on what we do might break backwards compatibility with existing UBJSON files.

My question is this...given that this proposal+draft 12 is a subset of the typespec proposal yes, which would allow more functionality, are all of those sacrifices I listed worth the additional functionality?

ghost commented 9 years ago

@Steve132 I see your point about this thing - ND arrays - and if we provide some specialized handling of it there is a real opportunity for UBJSON to make inroads in big data applications.

The application of UBJSON certainly excites me, but this addition (as-defined) and the concept of ND-arrays in general feel very forced to me -- I think @kxepal hinted at this (the comments around general format) and @Miosss said something that made me throw the brakes on and take a step back...

typespec (essentially @kxepal idea of 'schemas' for UBJSON) potentially have all the wins we want here for ND array as well as massive wins for objects (like MASSIVE size wins).

It seems clear to me that we only get to make a chance/addition/modification like this once and we shouldn't do it without bottoming out on that idea first.

If we go down that path and find schemas/typespec impossible to rectify and get into the spec in any intelligent way, then fine, we tried, and we can come back out of that discussion and see if we should adopt any variations of what we discussed.

This happened with binary support... the discussion took 2 years but the final mods to the spec were spiritually in line with what we wanted and functionally inline.

Lastly, to your very specific questions, yes I think it is worth the pain to discuss typespec and possibly semi-break the spec now if it means setting it up for success for the next 15-20 years untouched.

I'm going to close this proposal and ask that we move to discussing typespec - if we need to circle back and reopen/repropose the ND approach, I am fine with that, but for now I'd like to see if we can make the golden typespec work :)

fangq commented 5 years ago

@Steve132, I noticed this thread for a while, but did not chime in because I had a hard time convincing @rkalla for a similar proposal back in 2013 in this google group post

https://groups.google.com/d/msg/universal-binary-json/tgMCEbOmhes/s7JlCl58hvQJ

so I decided to go ahead and implemented it in my own version, bearing the risks of making the output non-standard.

in fact, the approaches we took were very similar. The "annotated array" you found from mat2ubjson was not the proposal I made, instead, if you call

saveubjson('',rand(50,50))

you can see that the output looks like

[[]
  [$] [D] [#] [[] [$] [U] [#] [2] [50] [50] []] [50*50*sizeof(double)]
[]]

in comparison, your proposal gives

[[]
  [$] [D] [@] [2] [U] [50] [U] [50] [50*50*sizeof(double)]
[]]

which is pretty much the same thing - I used a 1D array (plain or optimized form) to store the dimension vector, and you used [@] [len] [int][val]... to store the dimension vector.

I totally agree with you how important it is to have an intuitive and efficient support for N-D array, especially for people working with image data and MATLAB. Anyways, I added my "non-standard" N-D array syntax to a format specification I just drafted - the optimized N-D array syntax is the only major deviation from the ubjson specs - in the end, it does not matter what name it is called, as long as it is clearly defined somewhere.

Check it out, there's even already people doing exactly what I'm saying they will do:

https://github.com/fangq/jsonlab/blob/master/saveubjson.m#L246

As far as I can tell, mat2ubjson(randn(50,50)) outputs something like

[{]
    [i][11]["_ArrayType_"][S][i][7]["float64"]
    [i][11]["_ArraySize_"]
        [[][$][i][#][i][2]
            [i][50]
            [i][50]
    [i][11]["_ArrayData_]
        [[][$][D][#][U][2500]
            [1.0][0.5]...//2498 more float64 random numbers
[}]

This is what the matlab2ubjson implementer CURRENTLY does (disclaimer, I didn't write that code and have never seen that code before 10 minutes ago..it was just obvious that's what they would do so I knew it would be there)

Now, this schema for ND arrays is absolutely not standard, could not be read as an ND array by anything else. In contrast, under this proposal the matlab2ubjson implementer could just output this instead:

[[][$][D][@][2][i][50][i][50]
    [1.0][0.5]...//2498 more float64 random numbers

Which could be read by any UBJSON implementation.

fangq commented 5 years ago

Just want to make a correction - in both my original google-group proposal as well as the JData specification draft, I mistakenly inserted the closing bracket ([]]) in my format description. I just want to clarify that this was not what I had actually implemented in JSONLab. Any array with an optimized header shall not have the closing bracket as required by Draft12.

In the above comment, the output for

saveubjson('',rand(50,50))

shall look like

[[]
  [$] [D] [#] [[] [$] [U] [#] [2] [50] [50] [50*50*sizeof(double)]

notice the removal of the two closing brackets. Alternatively, it can be written as (as permitted in JData spec)

[[]
  [$] [D] [#] [[] [U] [50] [U] [50] []] [50*50*sizeof(double)]