ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Backporting BJData features to UBJSON - ND array, u/m/M/h markers and handling of NaN #109

Closed fangq closed 2 years ago

fangq commented 4 years ago

@Steve132, Re your comments at https://github.com/Steve132/ubj/pull/8 and #108

I think the right step forward for the project would be to agree on an array spec, agree on other changes, and move forward to a 1.0 spec, along with reference implementations in as many languages as possible and a test suite.

Again, I am happy to assist on backporting the new syntax to UBJSON so we do not have to maintain two separate specs.

The main additions made in BJData Draft 1 include

  1. an optimized ND array container (via # followed by a 1D array object)

https://github.com/OpenJData/bjdata/blob/Draft_1/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type

  1. new dedicated markers for unsigned integers and half-precision floating-point numbers https://github.com/OpenJData/bjdata/commit/95544dc8a999cec1d019e22f24fccc70bbdf5073

  2. Instead of converting NaN/+-infinity to null, BJData keeps their respective IEEE 754 binary forms https://github.com/OpenJData/bjdata/commit/f3d338f75b8ebfacda6f5a2315eb393c40a3118b#diff-d94a316d99f805938f1320db036a84a5L174

I will explain the rationales for each of my additions, and we can discuss/approve/disapprove in this thread.

Steve132 commented 4 years ago

Some reasons why I don't like your ND array spec is that 1) The way you use

is ambiguous with the count specifier. 2) the way you use [ seems to add

an extra character where it's not needed. 3) It explicitly disallows multidimensional arrays of non-fixed or structured type (which is an unusual but reasonable use case that I think UBJ should support) 4) it seems to violate the TLV 'convention' we use somewhat.

But yes, I think we should work together on merging the specs if possible.

On Wed, May 13, 2020 at 10:20 AM Qianqian Fang notifications@github.com wrote:

@Steve132 https://github.com/Steve132, Re your comments at Steve132/ubj#8 https://github.com/Steve132/ubj/pull/8 and #108 https://github.com/ubjson/universal-binary-json/issues/108

I think the right step forward for the project would be to agree on an array spec, agree on other changes, and move forward to a 1.0 spec, along with reference implementations in as many languages as possible and a test suite.

Again, I am happy to assist on backporting the new syntax to UBJSON so we do not have to maintain two separate specs.

The main additions made in BJData Draft 1 include

1 an optimized ND array container (via # followed by a 1D array object)

https://github.com/OpenJData/bjdata/blob/Draft_1/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type

2 new dedicated markers for unsigned integers and half-precision floating-point numbers OpenJData/bjdata@95544dc https://github.com/OpenJData/bjdata/commit/95544dc8a999cec1d019e22f24fccc70bbdf5073

  1. Instead of converting NaN/+-infinity to null, BJData keeps their respective IEEE 754 binary forms OpenJData/bjdata@f3d338f#diff-d94a316d99f805938f1320db036a84a5L174 https://github.com/OpenJData/bjdata/commit/f3d338f75b8ebfacda6f5a2315eb393c40a3118b#diff-d94a316d99f805938f1320db036a84a5L174

I will explain the rationales for each of my additions, and we can discuss/approve/disapprove in this thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMAE6CT5637MMPTSJGLRRKUBHANCNFSM4M7ZM7PQ .

fangq commented 4 years ago

For the ND array construct, I am aware that @Steve132 has a similar proposal using the below construct

[[] [$] [type] [@] [1-byte Ndim] [Nx type] [Nx] [Ny type] [Ny] ... [serialized ND array in column-major order]

the ND array optimized container used in BJData was originally proposed in this mailing list post back in 2013, it is done via #+a UBJSON 1D integer array object, followed by row-major serialized array elements, for example

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...]  [Nx*Ny*Nz*...*sizeof(type)]
  or
[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]

The main reasons that had led to my prefer to the #+[ construct are

  1. it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

in my opinion, this requires the least amount of work for parser updates

  1. it does not limit the maximum dimension to 255 as the @ construct

  2. the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading.

  3. it feels intuitive because for an ND array, its count is not a single number but a 1D vector.

  4. This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, similar to what I proposed for messagepack: https://github.com/msgpack/msgpack/pull/267#issuecomment-598007006

  5. last, but not the least, the #+[ construct has been used in JSONLab since 2013, and currently has accumulated about 45,000 downloads from matlab file exchange, and about 1000 clones from github per week. Changing this syntax will render all previously generated files incompatible.

fangq commented 4 years ago

1) The way you use # is ambiguous with the count specifier.

I don't think so - in draft 12, # is supposed to be followed by one of the integer markers U,i,I,l,L and will be in no way to be confused with [.

3) It explicitly disallows multidimensional arrays of non-fixed or structured type (which is an unusual but reasonable use case that I think UBJ should support)

For extending to template objects, I reserved $+{ or #+{ for this purpose, which will also capitalize on the existing object parsing functions to handle the following structure without needing to write a specialized parsing routine. Similar to the ND array extension, I expect this to be much easier to be implemented from a parser's standing point.

Steve132 commented 4 years ago

it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

That's actually a good argument against doing this. Parsing the array as an array like you've done here in code is a bug, because it actually allows people to use nd-arrays (and arrays of strings or objects or floats) as the dimension specifier. We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs. Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

it does not limit the maximum dimension to 255 as the @ construct

I'd say that, in practice, this limitation isn't an actual problem. I've never seen a 256-dimensional array anywhere in actual usage for any problem domain. But sure.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading.

I actually completely agree with this.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector.

I understand your argument here.

One concern that I have: in UBJ I'd want to relax the restriction that an ND-array must have a type. An array can have a count but not a type in UBJSON. To me it seems like it's reasonable to do an ND array of unknown things.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, s

$+{ already has a meaning in UBJSON. It's a fixed array where each item is an object. The only available place for extension is #{

I think we can come together to find a middle ground spect here that doesn't break backwards compatibility for you in 99% of cases but fixes some of these problems.

On Wed, May 13, 2020 at 10:54 AM Qianqian Fang notifications@github.com wrote:

For the ND array construct, I am aware that @Steve132 https://github.com/Steve132 has a similar proposal using the below construct

[[] [$] [type] [@] [1-byte Ndim] [Nx type] [Nx] [Ny type] [Ny] ... [serialized ND array in column-major order]

the ND array optimized container used in BJData was originally proposed in this mailing list post back in 2013, it is done via #+a UBJSON 1D integer array object, followed by row-major serialized array elements, for example

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] [NxNyNz...sizeof(type)] or [[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [NxNyNz...sizeof(type)]

The main reasons that had led to my prefer to the #+[ construct are

  1. it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

in my opinion, this requires the least amount of work for parser updates

1.

it does not limit the maximum dimension to 255 as the @ construct 2.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading. 3.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector. 4.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, similar to what I proposed for messagepack: msgpack/msgpack#267 (comment) https://github.com/msgpack/msgpack/pull/267#issuecomment-598007006

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628045398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDEOLLRDPMFZZCV6XTRRKYD5ANCNFSM4M7ZM7PQ .

Steve132 commented 4 years ago

"For extending to template objects, I reserved $+{ or #+{ for this purpose, "

${ already has a meaning in UBJSON. (It's an array of objects). Keep in mind that an array of objects is NOT always going to be a template.

An ND-array of dynamic type does not always have to be typed, and it doesn't always have to be a template. The template use case is a separate case.

On Wed, May 13, 2020 at 11:20 AM Steven Braeger steve@soapforge.com wrote:

it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

That's actually a good argument against doing this. Parsing the array as an array like you've done here in code is a bug, because it actually allows people to use nd-arrays (and arrays of strings or objects or floats) as the dimension specifier. We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs. Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

it does not limit the maximum dimension to 255 as the @ construct

I'd say that, in practice, this limitation isn't an actual problem. I've never seen a 256-dimensional array anywhere in actual usage for any problem domain. But sure.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading.

I actually completely agree with this.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector.

I understand your argument here.

One concern that I have: in UBJ I'd want to relax the restriction that an ND-array must have a type. An array can have a count but not a type in UBJSON. To me it seems like it's reasonable to do an ND array of unknown things.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, s

$+{ already has a meaning in UBJSON. It's a fixed array where each item is an object. The only available place for extension is #{

I think we can come together to find a middle ground spect here that doesn't break backwards compatibility for you in 99% of cases but fixes some of these problems.

On Wed, May 13, 2020 at 10:54 AM Qianqian Fang notifications@github.com wrote:

For the ND array construct, I am aware that @Steve132 https://github.com/Steve132 has a similar proposal using the below construct

[[] [$] [type] [@] [1-byte Ndim] [Nx type] [Nx] [Ny type] [Ny] ... [serialized ND array in column-major order]

the ND array optimized container used in BJData was originally proposed in this mailing list post back in 2013, it is done via #+a UBJSON 1D integer array object, followed by row-major serialized array elements, for example

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] [NxNyNz...sizeof(type)] or [[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [NxNyNz...sizeof(type)]

The main reasons that had led to my prefer to the #+[ construct are

  1. it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

in my opinion, this requires the least amount of work for parser updates

1.

it does not limit the maximum dimension to 255 as the @ construct 2.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading. 3.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector. 4.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, similar to what I proposed for messagepack: msgpack/msgpack#267 (comment) https://github.com/msgpack/msgpack/pull/267#issuecomment-598007006

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628045398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDEOLLRDPMFZZCV6XTRRKYD5ANCNFSM4M7ZM7PQ .

Steve132 commented 4 years ago

I think I can draw up a proposed ND-array spec for this that doesn't break backcompatibility with you 99% of the time but fixes these issues. Give me a day or so.

On Wed, May 13, 2020 at 11:23 AM Steven Braeger steve@soapforge.com wrote:

"For extending to template objects, I reserved $+{ or #+{ for this purpose, "

${ already has a meaning in UBJSON. (It's an array of objects). Keep in mind that an array of objects is NOT always going to be a template.

An ND-array of dynamic type does not always have to be typed, and it doesn't always have to be a template. The template use case is a separate case.

On Wed, May 13, 2020 at 11:20 AM Steven Braeger steve@soapforge.com wrote:

it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

That's actually a good argument against doing this. Parsing the array as an array like you've done here in code is a bug, because it actually allows people to use nd-arrays (and arrays of strings or objects or floats) as the dimension specifier. We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs. Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

it does not limit the maximum dimension to 255 as the @ construct

I'd say that, in practice, this limitation isn't an actual problem. I've never seen a 256-dimensional array anywhere in actual usage for any problem domain. But sure.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading.

I actually completely agree with this.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector.

I understand your argument here.

One concern that I have: in UBJ I'd want to relax the restriction that an ND-array must have a type. An array can have a count but not a type in UBJSON. To me it seems like it's reasonable to do an ND array of unknown things.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, s

$+{ already has a meaning in UBJSON. It's a fixed array where each item is an object. The only available place for extension is #{

I think we can come together to find a middle ground spect here that doesn't break backwards compatibility for you in 99% of cases but fixes some of these problems.

On Wed, May 13, 2020 at 10:54 AM Qianqian Fang notifications@github.com wrote:

For the ND array construct, I am aware that @Steve132 https://github.com/Steve132 has a similar proposal using the below construct

[[] [$] [type] [@] [1-byte Ndim] [Nx type] [Nx] [Ny type] [Ny] ... [serialized ND array in column-major order]

the ND array optimized container used in BJData was originally proposed in this mailing list post back in 2013, it is done via #+a UBJSON 1D integer array object, followed by row-major serialized array elements, for example

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] [NxNyNz...sizeof(type)] or [[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [NxNyNz...sizeof(type)]

The main reasons that had led to my prefer to the #+[ construct are

  1. it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

in my opinion, this requires the least amount of work for parser updates

1.

it does not limit the maximum dimension to 255 as the @ construct 2.

the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading. 3.

it feels intuitive because for an ND array, its count is not a single number but a 1D vector. 4.

This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, similar to what I proposed for messagepack: msgpack/msgpack#267 (comment) https://github.com/msgpack/msgpack/pull/267#issuecomment-598007006

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628045398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDEOLLRDPMFZZCV6XTRRKYD5ANCNFSM4M7ZM7PQ .

edgar-bonet commented 4 years ago

For the record, I implemented my own proposal for N-dimensional arrays that does not require extra syntax. It just adds the notion that, in a strongly-typed array, all the bytes that come before the first data field form the array's type specifier. Type specifiers can then be parsed recursively in a very natural way:

This is efficient for the kind of data I deal with, which contains lots of very long and narrow matrices of float32, just like in this example. Binary JData would be about as efficient for me, at the cost the extra complexity that comes with adding new syntax. The current UBJSON specification does not permit storing this efficiently.

Steve132 commented 4 years ago

@edgar-bonet

We sort of went over this at the time, but can you clarify semantically what makes your proposal different than either 2 of what we've already sort of proposed here? It seems like your proposal the way you are using it provides the same semantics as the other two proposals, but it's much more complicated to parse. (you have to lookahead to complete the entire array template type recursively and then pop up off the stack to get the resulting N-D dimensions)

It also (worryingly, to me at least) seems to imply that it's a more generic 'schema' that would allow people to insert templated arrays of arrays into the stack structure, but that's not the case. For example,

[ [ ] [$] [ [ ][#][[][i][2] [#][3200] = array of type (array of type (dynamic array) with length 2) with length 3200

On Wed, May 13, 2020 at 2:36 PM Edgar Bonet notifications@github.com wrote:

For the record, I implemented my own proposal for N-dimensional arrays http://../issues/43 that does not require extra syntax. It just adds the notion that, in a strongly-typed array, all the bytes that come before the first data field form the array's type specifier. Type specifiers can then be parsed recursively in a very natural way:

  • [[][$][d][#][i][2] = array of type float32 and length 2
  • [[][$] [[][$][d][#][i][2] [#][I][3200] = array of type (array of type float32 and length 2) and length 3200

This is efficient for the kind of data I deal with, which contains lots of very long and narrow matrices of float32, just like in this example. Binary JData would be about as efficient for me, at the cost the extra complexity that comes with adding new syntax. The current UBJSON specification does not permit storing this efficiently.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628171494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIME6WAYI4AET6E5I333RRLSCBANCNFSM4M7ZM7PQ .

fangq commented 4 years ago

We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs.

not entirely sure what was the main concern against using recursive calls. In a way, we are already doing it in almost all existing parsers because JSON/UBJSON array and object constructs can be embedded in one another. If you are worried about the recursion depth and stack-overflow risks, then I think the same concerns should be applied to regular array/object parsing, and not specific to this size construct.

the only thing to avoid stack-overflow is to change the entire parser structure using an iterator or some sort to convert recursion into loops.

Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

if I just want to prevent it from being abused by constructing deep-recursion, adding a simple flag, like what I just added to jsonlab will likely eliminate the risk, and the required change is quite minimal and local

https://github.com/fangq/jsonlab/commit/86efe89e734331d2133092f9a40e40560554d538

edgar-bonet commented 4 years ago

@Steve132 wrote:

We sort of went over this at the time

Indeed. I am commenting mostly for @fangq, who was not there at the time.

It seems like your proposal the way you are using it provides the same semantics as the other two proposals

Of course. This is all about providing a representation of the most common forms of multidimensional arrays that is more efficient without altering their semantics.

it's much more complicated to parse

I disagree. Recursion makes it easy. Recursion is what makes the JSON data model both simple an powerful: containers hold values which can themselves be containers which hold values which...

you have to lookahead [...]

Yes, but never more than one byte:

  1. When the type parser sees [ or {, it reads the following byte in order to see whether it is either $ or #, which would indicate an optimized representation of the container. If it is neither (non optimized container), it “unreads” that byte (with the equivalent of ungetc()) before calling the value parser for reading the first contained element.

  2. When reading an non optimized container, the value parser has to read one byte past the current element in order to check whether it is the container's closing delimiter. If it is not, it unreads the byte before calling itself recursively for reading the next element.

But note that this has nothing to do with my proposal, which does not introduce any extra lookahead requirements.

For example [ [ ] [$] [ [ ][#][[][i][2] [#][3200]

I can't visually parse your example. [#][[][i][2] seems to imply that the count field is an array, which is legit in fangq's proposal but not in mine.

Steve132 commented 4 years ago

I don't have a problem with recursion in general. Recursion in general is great, especially for a parser.

I have two separate arguments about two separate proposals: The first argument is about this claim of your proposal "The implementer can re-use the same parser". My claim is that they cannot.

[[] [$] [{] [#] [U] [3] [{][}][{][}][{][}]

This is a valid 1-D array specifier. (in particular, it's a 1-D array of 3 empty objects} It is NOT a valid N-D array dimension header according to your spec (the type is object not int). Therefore you can't simply use the standard array parser unmodified to parse the N-D array header. You have to have two array parsers. One of them that works generically and the other which is constrained to the N-D array header spec (e.g, only integers, only one dimensional, only TCV).

But an implementer who re-uses the 1-D array parser code unmodified to parse your N-D array header would produce a broken implementation, because it would accept the above as a valid ND array header. (in fact, your patch to ubjr actually has this bug). Fixing the bug requires implementing a special case version of the array parser that only reads a valid N-D array header and not any array. So it's avoidable.

But there's no code reuse here. Reusing the code actually creates a bug. So since you can't reuse code, code reuse isn't a feature of your design. What's more, it encourages implementers to make the bug that you made (because it superficially looks like it works).

The second argument is about @edgar-bonet's implementation, which basically boils down to.

"What's easier to parse and understand: 'floats, 3 dimensions: 4 rows 10 columns 3 channels' or '( ( ( ( float ) 3 ) 10 ) 4'" They both work, but to me the first one is a lot easier to parse and validate.

On Wed, May 13, 2020 at 3:25 PM Qianqian Fang notifications@github.com wrote:

We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs.

not entirely sure what was the main concern against using recursive calls. In a way, we are already doing it in almost all existing parsers because JSON/UBJSON array and object constructs can be embedded in one another. If you are worried about the recursion depth and stack-overflow risks, then I think the same concerns should be applied to regular array/object parsing, and not specific to this size construct.

the only thing to avoid stack-overflow is to change the entire parser structure using an iterator or some sort to convert recursion into loops.

Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

if I just want to prevent it from being abused by constructing deep-recursion, adding a simple flag, like what I just added to jsonlab will likely eliminate the risk, and the required change is quite minimal and local

fangq/jsonlab@86efe89 https://github.com/fangq/jsonlab/commit/86efe89e734331d2133092f9a40e40560554d538

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628197296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDUVLNQBUD7NI5HVALRRLX37ANCNFSM4M7ZM7PQ .

Steve132 commented 4 years ago

@edgar-bonet If I want to read my ND array as an ND array, under your proposal, I have to parse the entire header recursively all the way down to get the type, then all the way back up and populate my array in reverse order in order to get all the dimensions.

( ( ( ( float ) 3 ) 10 ) 4 ) vs "float, 3 dimensions, 4,10,3"

On Wed, May 13, 2020 at 5:02 PM Steven Braeger steve@soapforge.com wrote:

I don't have a problem with recursion in general. Recursion in general is great, especially for a parser.

I have two separate arguments about two separate proposals: The first argument is about this claim of your proposal "The implementer can re-use the same parser". My claim is that they cannot.

[[] [$] [{] [#] [U] [3] [{][}][{][}][{][}]

This is a valid 1-D array specifier. (in particular, it's a 1-D array of 3 empty objects} It is NOT a valid N-D array dimension header according to your spec (the type is float not int). Therefore you can't simply use the standard array parser unmodified to parse the N-D array header. You have to have two array parsers. One of them that works generically and the other which is constrained to the N-D array header spec (e.g, only integers, only one dimensional, only TCV).

But an implementer who re-uses the 1-D array parser code unmodified to parse your N-D array header would produce a broken implementation, because it would accept the above as a valid ND array header. (in fact, your patch to ubjr actually has this bug). Fixing the bug requires

implementing a special case version of the array parser that only reads a valid N-D array header and not any array. So it's avoidable. But there's no code reuse here. Reusing the code actually creates a bug. So since you can't reuse code, code reuse isn't a feature of your design.

The second argument is about @edgar-bonet's implementation, which basically boils down to.

"What's easier to parse and understand: 'floats, 3 dimensions: 4 rows 10 columns 3 channels' or '( ( ( ( float ) 76 ) 10 ) 4'" They both work, but to me the first one is a lot easier to parse and validate.

On Wed, May 13, 2020 at 3:25 PM Qianqian Fang notifications@github.com wrote:

We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs.

not entirely sure what was the main concern against using recursive calls. In a way, we are already doing it in almost all existing parsers because JSON/UBJSON array and object constructs can be embedded in one another. If you are worried about the recursion depth and stack-overflow risks, then I think the same concerns should be applied to regular array/object parsing, and not specific to this size construct.

the only thing to avoid stack-overflow is to change the entire parser structure using an iterator or some sort to convert recursion into loops.

Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

if I just want to prevent it from being abused by constructing deep-recursion, adding a simple flag, like what I just added to jsonlab will likely eliminate the risk, and the required change is quite minimal and local

fangq/jsonlab@86efe89 https://github.com/fangq/jsonlab/commit/86efe89e734331d2133092f9a40e40560554d538

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628197296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDUVLNQBUD7NI5HVALRRLX37ANCNFSM4M7ZM7PQ .

Steve132 commented 4 years ago

To be clear, I don't necessarily think the bug caused by people wanting to re-use the array code is so severe as to veto using the JData standard for the syntax. As long as we warn implementers in the spec not to do it it's easy enough to avoid. But the JData standard does use more characters as a result as well...so it's worth considering.

edgar-bonet commented 4 years ago

@Steve132 wrote:

[[] [$] [{] [#] [U] [3] [{][}][{][}][{][}] is a valid 1-D array of 3 empty objects

My understanding is that, under the current spec, it is not. Either the array is untyped and the values have to be spelled in full:

[[] [#][U][3] [{][}][{][}][{][}]

or the array is typed and the type (a single byte) has to be removed from the values:

[[] [$][{] [#][U][3] [}][}][}]

What's easier to parse and understand [...]

I personally find “((float 3) 10) 4” easier, but that may be just me. I can understand that humans may find “floats, 3 dimensions: 4 rows 10 columns 3 channels” easier. :-)

Steve132 commented 4 years ago

or the array is typed and the type (a single byte) has to be removed from the values:

You're right. My bad. The point remains that

[[] [#][U][3] [{][}][{][}][{][}]

Is a valid array but is not a valid N-D array integer dimension header, so the code cannot be reused unmodified to parse both

I personally find “((float 3) 10) 4” easier, but that may be just me. I can understand that humans may find “floats, 3 dimensions: 4 rows 10 columns 3 channels” easier. :-)

That's basically my argument. I'm a mere human and I don't know anyone who prefers your way (no offense) it certainly works, but it's meaning is obscure to me. Parsing it properly requires a stack-style recursion. But mine and @fangq's do not.

On Wed, May 13, 2020 at 5:36 PM Edgar Bonet notifications@github.com wrote:

@Steve132 https://github.com/Steve132 wrote:

[[] [$] [{] [#] [U] [3] [{][}][{][}][{][}] is a valid 1-D array of 3 empty objects

My understanding is that, under the current spec, it is not. Either the array is untyped and the values have to be spelled in full:

[[] [#][U][3] [{][}][{][}][{][}]

or the array is typed and the type (a single byte) has to be removed from the values:

[[] [$][{] [#][U][3] [}][}][}]

What's easier to parse and understand [...]

I personally find “((float 3) 10) 4” easier, but that may be just me. I can understand that humans may find “floats, 3 dimensions: 4 rows 10 columns 3 channels” easier. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-628258826, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMECH42NQLWQO7XIG5LRRMHFZANCNFSM4M7ZM7PQ .

fangq commented 4 years ago

@edgar-bonet, thanks for your proposal - what this tells us is that we DO NEED to include a native ND array header in the spec because it is such an important data type to support.

@Steve132 - Re your argument regarding whether it is appropriate to reuse an array parser

You can do it in more than one ways for sure - as you suggested, one can certain write a dedicated 1D array parser, which just needs a few extra lines (involving a loop), one can do it for both @ based header or #[ based header. However I don't see why it is a bad idea to reuse the generic array parser (like in 3 of my existing implementations), and then test the shape/value of the returned values. The 2nd approach is easier to program, and easier for future extensions - for example, if in the future we extend this dimensional array to 2D, perhaps we can handle tensors or other types of data partitioning schemes. There is not right or wrong, it is just trade off - writing more codes vs less code with possible overheads, or to progressively detect a non-conformant data structure as you read, vs loading the data first, and then validate it as a whole object.

I agree that my ubj patch does not "validate" the array size vector, but it can certainly be added.

@edgar-bonet, I think your proposal is one of many ways to add this, but the downside is the needs to use stack for each additional dimension, this comes with a bigger risk of stack overflow and slower performance.

fangq commented 4 years ago

@Steve132, for some reason, I felt this discussion is steered towards what is the best practice for writing a parser, which is not really the focus of this thread - we just define a format, and let the parser writers to come up with a parser that works with the format. Whether the parser is 100% compliant, whether it has a robust implementation, or have followed the best practices, they are really up to the parser writers. A good thing about JSON and UBJSON parser is that the format is so simple that writing a parser is not that difficult, that gives space for people to write their own flavors with different trade-offs (complexity vs utility).

I am looking forward to hearing about your proposal of BJData compatible array construct. I don't mind you list both constructs as valid UBJSON formats, as long as these new features (which really comes from the research needs from my community) are supported, and subsequently supported by the related parsers.

On a related note - I really like the MarkDown version of the UBJSON spec that py-ubjson team wrote, which I used as the starting point for writing the BJData spec. It takes the advantage of the convenient formatting supported by github and allows people to collaboratively develop a document. Comparing with the html version of the UBJSON spec, I feel the markdown version is much more efficient, easy to read and easy to revise. I really wish that UBJSON can take this new form to continue its development and future discussions (especially given that the ubjson.org website is not functional).

Steve132 commented 4 years ago

we just define a format, and let the parser writers to come up with a parser that works with the format.

Part of a standard is making sure that it's easy to implement without bugs. It does matter.

I don't mind you list both constructs as valid UBJSON formats, as long as these new features (which really comes from the research needs from my community) are supported, and subsequently supported by the related parsers.

I probably won't include multiple constructs to do the same thing into UBJSON. But I do think that a lot of the other issues you've brought up are important and worth looking into. I think a big open question could be how much can we avoid fragmentation in this space: Are you willing to standardize on UBJSON to avoid fragmentation if it ends up being modified to sufficiently meets your needs? (e.g. eliminating the BJData spec?)

t takes the advantage of the convenient formatting supported by github and allows people to collaboratively develop a document. Comparing with the html version of the UBJSON spec, I feel the markdown version is much more efficient, easy to read and easy to revise

I agree 100%. Another big issue I'm currently working on is versioning. I think we need to make an "alpha" version (spec 12) a "beta" version (spec 13) and a "1.0 version" (including whatever we end up deciding.

fangq commented 4 years ago

@Steve132, any update on this?

Steve132 commented 4 years ago

Yeah I'm organizing my notes into a document.

On Fri, May 29, 2020 at 11:52 AM Qianqian Fang notifications@github.com wrote:

@Steve132 https://github.com/Steve132, any update on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-636047519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGWIMDE7KAKELKQVDLTBM3RT7K2RANCNFSM4M7ZM7PQ .

fangq commented 3 years ago

@Steve132, @rkalla, and other UBJSON contributors and maintainers - I am glad to share an update:

I just received a 5-year funding support from the NIH (National Institute of Health) to develop a new-generation neuroimaging data exchange platform named NeuroJSON (http://neurojson.org, https://github.com/NeuroJSON). The goal of the project is to develop a series of specifications to map major neuroimaging data files, standardized by the BIDS project (https://bids.neuroimaging.io/), to a unified JSON/BJData based exchange format to gain human-readability, easy parsing, reuse and integration with NoSQL database (such as MongoDB, CouchDB).

As I mentioned above, the BJData spec (Binary JData) was largely based upon UBJSON, with the extensions described in this ticket.

I see the updated UBJSON website now has a placeholder for UBJSON 2.0 - it would be wonderful if my above proposed extensions can be reviewed and incorporated into UBJSON spec. It will be mutually beneficial to join force so that our user communities can enjoy the works from both initiatives.

By the way, we are recruiting a full-time programmer to drive the NeuroJSON project and develop new parsers/libraries/specifications, and incorporate those to prominent neuroimaging software such as FieldTrip, SPM, FreeSurfer, BIDS etc. If any UBJSON developer/contributors is interested in joining this effort, please checkout our job posting:

https://careersmanager.pageuppeople.com/879/ci/en-us/job/507821/scientific-programmer

rkalla commented 3 years ago

Qianqian, congratulations!!

I started to take a stab at rethinking 2.0 while on a sabbatical between careers. It’s a huge breaking refinement however and will go slowly so I’m not sure that would work with your timelines.

We will support however we can, keep us posted!

Riyad

On Wed, Sep 15, 2021 at 9:39 PM Qianqian Fang @.***> wrote:

@Steve132 https://github.com/Steve132, @rkalla https://github.com/rkalla, and other UBJSON contributors and maintainers - I am glad to share an update:

I just received a 5-year funding support from the NIH (National Institute of Health) to develop a new-generation neuroimaging data exchange platform named NeuroJSON (http://neurojson.org, https://github.com/NeuroJSON). The goal of the project is to develop a series of specifications to map major neuroimaging data files, standardized by the BIDS project ( https://bids.neuroimaging.io/), to a unified JSON/BJData https://github.com/NeuroJSON/bjdata based exchange format to gain human-readability, easy parsing, reuse and integration with NoSQL database (such as MongoDB, CouchDB).

As I mentioned above, the BJData spec (Binary JData) was largely based upon UBJSON, with the extensions described in this ticket.

I see the updated UBJSON website now has a placeholder for UBJSON 2.0 - it would be wonderful if my above proposed extensions can be reviewed and incorporated into UBJSON spec. It will be mutually beneficial to join force so that our user communities can enjoy the works from both initiatives.

By the way, we are recruiting a full-time programmer to drive the NeuroJSON project and develop new parsers/libraries/specifications, and incorporate those to prominent neuroimaging software such as FieldTrip, SPM, FreeSurfer, BIDS etc. If any UBJSON developer/contributors is interested in joining this effort, please checkout our job posting:

https://careersmanager.pageuppeople.com/879/ci/en-us/job/507821/scientific-programmer

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ubjson/universal-binary-json/issues/109#issuecomment-920569560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG26F33WFIGDHNYLPRMSLUCFYITANCNFSM4M7ZM7PQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

to-miz commented 2 years ago

I think number 2 is a sensible extension to UBJSON, since missing unsigned types can introduce incompatibilities in some programming languages or platforms. This is because I also don't see which binary format signed integers are using in the spec (I am assuming two's complement, this should definitely also be added to the spec), which is technically not a given for all platforms, although true in practice.

I don't think the extensions in 1 and 3 should be added to UBJSON, I think UBJSON should remain universal and as compatible to JSON itself as possible. I think one of the main advantages of UBJSON to other binary formats is, that it can be a drop in replacement to JSON, because they are so similar, and this should be kept, with the same limitations that JSON has. You would gain smaller data to be transferred and faster parsing speeds, but compatible representations of data.

This is why I don't think an extension for N-D arrays makes sense, because it is too specific thus not universal, as well as not really compatible with JSON. An N-D array should know itself, how to serialize itself to UBJSON/JSON and deserialize and validate itself, this should not be part of the serialization format itself. It is imposing a kind of schema to a schemaless format. It is also limiting and not useful in a general sense. Small N-D arrays can be serialized in a linear fashion, but for big matrices for instance, a better format would probably be to store and serialize them sparsely. N-D arrays don't help here.

Although UBJSON is a binary format, I think it is more important to keep the compatibility to JSON, since JSON is more widespread. As such, in my opinion, UBJSON should also not be extended in a manner to enable the serialization of specific binary data at all, since JSON itself is not suitable for that. UBJSON makes it more convenient to store binary data than JSON, but that should not be the reason to use it. There are better binary formats for that.