EBNF specification for ubjson

ghost commented 8 years ago

While preparing for https://github.com/tbuitenhuis/zgio I’ve been writing an EBNF specification of ubjson. I’ve been asked to post it here.

After some cleaning up and finding answers to remaining questions this will be useful to others writing ubjson parsers and generators, and I intend to get it to that state around the time of the next draft or sooner. Right now these are just my notes. Help looking for misunderstandings will be appreciated.

; UBJSON specification draft [???version???]
; copyright information / freedom to share / no warranty goes here

document = object ;

; [OPINION] If segments of UBJSON are stored inside another file format that has a way to detect damage, it’s acceptable to
; omit the initial "{", but it must be added in front when the UBJSON is extracted to a separate file or stream.
; -> [PARSER OPTION] use <document = objectcontent> instead.

item = object | array | string | number | singlebyte ;

; noop is not an item
noop = "N" ;

singlebyte = boolean | null ;

boolean = "T" | "F" ;

; null, also infinity
null = "Z" ;

number = "i", ivalue
       | "U", Uvalue
       | "I", Ivalue
       | "l", lvalue
       | "L", Lvalue
       | "d", dvalue
       | "D", Dvalue
       | "H", Hvalue
       ;

string = "S", Svalue
       | "C", Cvalue
       ;

; The comments on strong typing will explain the reason for the <*value> rules.
ivalue = ? int8-be ? ;
Uvalue = ? uint8-be ? ;
Ivalue = ? int16-be ? ;
lvalue = ? int32-be ? ;
Lvalue = ? int64-be ? ;
dvalue = ? IEEE 754 binary32 ? ; [INVESTIGATE] is be/le conversion needed?
Dvalue = ? IEEE 754 binary64 ? ; [INVESTIGATE] same
Cvalue = ? ASCII character ? ;

; <Svalue> and <Hvalue> are below <count>.

;
;
; Lengths
; =======

count = "i", ? int8-be: value >= 0 ?
      | "U", ? uint8-be ?
      | "I", ? int16-be: value >= 0 ?
      | "l", ? int32-be: value >= 0 ?
      | "L", ? int64-be: value >= 0 ?
      ;
; [OPINION] "i" should be accepted in <count> but "U" should be produced instead.
;
; 32 bit systems must not cast int64 down to int32 in malloc, that would cause buffer overflows. 
; [OPINION] they may fail instead, with an out of memory error or similar.

Hvalue = count, ? decimal number represented by ASCII string of length int(count) matching 
                  [ "-" ],
                  ( "0" | ( digit - "0", { digit } ) ),
                  [ ".", digit, { digit } ],
                  [ ( "e" | "E" ), [ "+" | "-" ], digit, { digit } ]
                ? ;
; By definition int(count) must be > 0 here, it might be good idea to check that before trying to parse the high precision number.
; Parsers that cannot interpret high precision numbers do not have to verify them, they should fail when they encounter one
; unless:
; [PARSER OPTION] pass high precision numbers as-is
; [PARSER OPTION] skip high precision numbers (WILL NOT IMPLEMENT)

Svalue = count, ? UTF-8 string represented by int(count) bytes ? ;

;
;
; Strong Typing
; =============
;
; Containers can have a type, which is represented by one of the following characters. For each, there is a matching
; <$(character)value> rule that can be used in the container content (and which is reused in some other places).

type = "i" | "U" | "I" | "l" | "L" | "d" | "D" | "H" | "S" | "C" | "{" | "[" ;

; Typed containers cannot contain typed containers themselves, so we need a way to disable the type system. Parsers may of
; course use a different method than through the grammar like we do here.
; <simple$(container)>s contain <simpleitem>s instead of <item>s, and <simpleitem>s cannot be not-simple <$(container)>s.
; [QUESTION] Are additional "#" meant to be allowed in contained <object>s and <array>s, unlike additional "$" ?

; [value
arrayvalue = { simpleitem, { noop } } "]"
           | "#", { noop }, count, { noop }, ?int(count)? * ( simpleitem, { noop } )
           ;

; {value
objectvalue = { Svalue, { noop }, simpleitem, { noop } }, "}"
            | "#", { noop }, count, { noop }, ?int(count)? * ( Svalue, { noop }, simpleitem, { noop } )
            ;

simpleitem = simpleobject | simplearray | item - ( object | array ) ;

simplearray = "[", { noop }, arrayvalue ;

simpleobject = "{", { noop }, objectvalue ;

; Finally, the full featured containers.
; [OPINION] zero length containers should be produced without a <count> and <type>.

array = "[", { noop }, arraycontent ;

arraycontent = { item, { noop } }, "]"
             | "#", { noop }, count, { noop }, ?int(count)? * ( item, { noop } )
             | "$", { noop }, type, { noop }, "#", { noop }, count, { noop }, ?int(count)? * ?$(type)value?, { noop } 
             | "$", { noop }, singlebyte, { noop }, "#", { noop }, count, { noop }
             ;

object = "{", { noop }, objectcontent ;

objectcontent = { Svalue, { noop }, item, { noop } }, "}"
              | "#", { noop }, count, { noop }, ?int(count)? * ( Svalue, { noop }, item, { noop } )
              | "$", { noop }, type, { noop }, "#", { noop }, count, { noop }, ?int(count)? * ( Svalue, ?$(type)value?, { noop } )
              | "$", { noop }, singlebyte, { noop }, "#", { noop }, count, { noop }, ?int(count)? * ( Svalue, { noop } )
              ;

; [OPINION] like in JSON keys must not be repeated, but ignoring that issue might be acceptable for stream parsers.
; -> [PARSER OPTION] no key reuse checking

ghost commented 8 years ago

I think I misunderstood how no-op is used. It has now been fixed(?) above.

Note I decided to allow no-op anywhere it would not cause ambiguity. This allows the sender of a message to send keep-alive no-ops while it’s collecting all items in a container to count them and to see if they’re all the same type.

Note2: the grammar in its current state contradicts the "no data payload" example because it does not allow no-op as a type. If no-op really is meant to be considered a type, expect an issue calling that a bug :) .

ErikMcClure commented 8 years ago

For the sake of helping me ensure my parser did everything it was supposed to, I wrote up this context-free grammar using ANTLR3 syntax. It focuses entirely on the rules governing interpreting a raw byte stream, and considers no-op to be a type, because this is apparently what the standard calls for. I'm putting it up here so people can compare different interpretations of the standard to help identify weaknesses (like the no-op issue) or clarifications that should be made (like in #73). This grammar explicitly allows nested typed containers, because I could not find anything in the specification saying otherwise. Note that ANTLR cannot generate a parser out of this because the optimized containers are ambiguous without additional context.

grammar ubjson;

fragment
BYTE : ('\u0000'..'\uFFFF');

fragment
CHAR :  ('\u0000'..'\u007F');

ubjson : object ;

byte1 : BYTE ;
byte2 : byte1 byte1 ;
byte4 : byte2 byte2 ;
byte8 : byte4 byte4 ;
bytes : BYTE* ;
integertype : 'i' byte1 | 'U' byte1 | 'I' byte2 | 'l' byte4 | 'L' byte8 ;
id : integertype bytes ; 
numerictype : integertype | 'd' byte4 | 'D' byte8 | 'H' id ;
stringtype : 'S' id ;
valuetype : numerictype | 'Z' | 'N' | 'T' | 'F' | 'C' CHAR | stringtype ;

bareobject : (id type)* '}' | '#' integertype (id type)* | '$' type '#' integertype (id baretype?)* ;
barearray : type* ']' | '#' integertype type* | '$' type '#' integertype baretype* ;
baretype : id | bytes | CHAR | barearray | bareobject ;

object : '{' bareobject ;
array : '[' barearray ;
//object : '{' (id type)* '}';
//array : '[' type* ']'; 

containertype : object | array ;
type : valuetype | containertype ;

Steve132 commented 8 years ago

This part of the spec

; Typed containers cannot contain typed containers themselves, so we need a way to disable the type system.

That's not true.

ubjson / universal-binary-json

EBNF specification for ubjson #71