Focus UBJSON on being a "streaming JSON" transport format

ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

116 stars 12 forks source link

Focus UBJSON on being a "streaming JSON" transport format #64

Closed MikeFair closed 9 years ago

MikeFair commented 9 years ago

I think you guys are doing a fantastic job at driving forward the different considerations and I commend you guys for such amazing work. I started reading a couple days ago on the STC/ND Array topic because that's one area I could actively benefit from.

I think I can lend some weight to some of the discussions to help bring some of these items home to a resolution but I realized one of the unspoken/unresolved questions seems to be "What actually is UBJSON really going to be used for?"

One claim is: UBJSON is a simple and direct translation of JSON; While that's clear, concise, and clearly doable in defining boundaries, I don't see it being an extremely useful addition to the landscape as "UBJSON is a binary capable format for streaming JSON between endpoints".

It maintains the consistency of being a JSON translation, but it's also more than a strict translation it's a transport format so two enpoints can easily speak JSON to each other in a compressed and optimal way. Which I think is actually the real point of creating the spec in the first place.

Some proposals to consider as a streaming protocol: 1) Array slices: "Here are elements 6,000 through 8,000 of that 500,000 element array we nicknamed x8C4ED" 2) typespecs: "Here is another instance of object type #DC2167" 3) requests: "Please transmit elements 15 through 30 of x8C4ED"

Is a streaming protocol really a good fit for UBJSON's future?

a) There's been some consideration for UBJ as a DB backend file format. Cutting to the meat of it, features required to optimize the file format for a database backend would end up conflicting with core design principals of the spec; and without optimizing for data structure searching, seeking, and iterating it would make for a subpar database backend; and ultimately get replaced with something that is good at that (hence will become a separate spec). So UBJSON is not intending to be an on-disk access optimized format for applications.

b) For storing JSON data in files simply compressing full blown static JSON is not that bad at that. Plus I could also use any of the other binary JSON formats that already exist if that's what I want/need to do. So UBJSON doesn't need/want to be just a JSON file format for long term storage.

What's left?
c) Transmitting JSON between systems either on the same computer or across the network. Specifically, a data encapsulation format for RPC mechanisms. This quickly evolves into a streaming type protocol where the format is enabling the transmission of smaller binary optimized segments of a larger JSON data space.

d) Stealing a chapter from EXI (Efficient XML Interchange), which I consider to contain lots of genius, UBJSON ought to focus on being a good format for other binary representations to easily convert their data directly from/to for transmission; skipping the actual expanded JSON representation part. If somewhere in the processing pipeline a UBJSON interpreter does not exist, then it just expands to full blown JSON and continues. It's not just the data bloat of on the wire JSON that is really holding back adoption, it's the conversion from binary to JSON and then back to binary. UBJSON is an opportunity to "fix that". Which also gives UBJSON a place of its own beyond just a JSON translation.

e) That said, once applications can easily convert their binary format into UBJSON, the very next thing to do is send it across a pipe or wire somewhere. And that means streaming it. If UBJ focuses on being a really great "real time JSON streaming format"; then it can really add lots of value to the landscape. As a developer I start with the easy standard JSON support, and then I can optimize the UBJ support. That's a clear development roadmap for me that helps me tweak within the protocol to optimize for my use cases over time.

Thank you for the consideration.

ghost commented 9 years ago

@MikeFair Interesting perspective... can we take it a step further, not asking for micro-details, but at a high level let's say we all agree that it should be a baseline format like you are proposing - what kinds of changes would you recommend for the spec to meet this need? (just verbal description is fine)

Just trying to get the creative juices flowing around this idea...

MikeFair commented 9 years ago

I can think of several, but at a high level they would be around two areas: 1) Managing/Compositing a set of larger JSON documents/objects from many smaller messages 2) Creating format layouts for common in-memory storage patterns

For example, the very first thoughts I had along these lines stemmed from the ND Array stuff @Steve132 mentioned. Working with transporting extremely large matrices myself, @Steve132's observation that the way it actually works is usually a 1D in-memory array with a header/metadata on how to interpret it is exactly right. Whether its a dense matrix, or a sparse matrix (and sparse matrices haven't even been considered yet), we transfer a "layout" along with a long 1D stream. To extend this to a streaming format; we would need to include array slice indeces for each dimension.

So in addition to providing the array lengths, the indeces of what "section" of the array is being represented would be included. I could then "stream" many of these array slices over time, possibly out of order, and possibly with redundant overlapping slice indeces to replace what was there before.

Also, building up objects by compositing them. Along the same lines; say I have a structure, User -> Accounts -> Years -> Months -> Transactions That's an easily huge JSON doc. If the protocol/format were "streamable" then when the user first logged in I could send something like: {"UserNameDoc": { "user": { "field1": "sometext", "fieldN": "somedata" "Accounts": ... } }}

then later I can send: {"UserNameDoc" +: { "user" +: { "Accounts": [ {"Account1": {"AccountName": "SomeName", "Transactions": [{"2014": [{"Jan": ...}, {"Feb": ...}, {"Mar": ...}, ...]}, ...] , ...} , {"Account2":, ...} , ...] } }}

Then even later to delete the second element of the Accounts array ("Account2") I could send: {"UserNameDoc" -: {"user" -: {"Accounts" -: [2]}}}

So in this way, I can control which JSON snippets, and for which JSON documents as a series of diffs, and using UBJSON as a series of binary diffs.

So the extensions to the format would be adding internal document addresses to the top level header objects, and in my example using "..." to specify the "I haven't sent this to you, but there is more stuff there" parts. The receiver side could then use the document addressing to request those parts be filled in as the application demands.

MikeFair commented 9 years ago

My last example did a couple streaming use cases; for the simplified binary transformation part I'll use dense matrices as that's on everyone's brains atm.

I've done lots of work in the financial space and have experience with discussing the needs of shipping those large matrices regularly. There's a lot of overlap with data transfer requirements in the scientific, sensor, or financial data pipeline/workflow space.

ND Arrays can be in either a Column Based or Row Based configuration and both are required. This distinction matters for speed in processing. In @Steve132's original proposal there was a byte for dimensionality; I think that's probably a good idea; but this idea proposes that it either be a signed 8bits, using 0-127 be row based (traditional layout) and -0 through -127 be column based (rotated layout); or to add a new flag to describe the layout and leave the dimension byte unsigned. This way whichever way the application stores it becomes an easy method for it to put it on the wire. I agree with @Steve132 that standardizing how to represent these things is a good idea. In TCP/IP they standardized the endian format on the wire; I feel this was a mistake, they should have used a bit to describe which endian it was. That way there's only a mismatch (and therefore a conversion) when transferring between heterogeneous systems but homogeneous systems don't suffer an overhead just because they used TCP/IP. The application authors can then decide whether they want to pretranslate the data or not (for instance, an Intel machine serving data to lots of ARM platforms clients might take on the overhead locally to put the bits in the endian format the ARM clients use).

@Steve132's observation that the 1D array can be laid out contiguously in memory for dense arrays makes a huge difference. Being able to describe whether those dimensions are column based or row based drives the point home. When I have a native column based representation like in Matlab, I can send the UBJSON header to describe the array, then do straight memory dump of the array itself, finishing off by sending any closing tags.

I must use a format that UBJSON has documented, but I can use any documented representation.

This means I can represent the same dense array in either row based or column based form; whichever makes the most sense to me.

Another point when it comes to large matrices they can be either sparse and dense; supporting the more popular space optimized sparse matrix formats would be a big boost to the matrix using and loving communities.

MikeFair commented 9 years ago

Another feature of creating a "streamable" JSON document format is splitting the document up into pieces.

I can send the whole document all at once to one endpoint in one transfer, or because I can split it up I can send different pieces of the document to different endpoints and wait for them to send their individual pieces back (think map/reduce).

Every piece of the pipeline is dealing with the same JSON document, but they aren't all necessarily dealing with the same parts of it; nor have they all necessarily received the whole document.

This also works for optimizing typecasts. Rather than having to resend the typecast header with every document, I can send it once early in the stream, and then continue referencing it later on.

This a trick similar to what the NX Protocol used to get huge reductions when compressing the over the wire version of the X Protocol. They took a library call that typically takes many parameters (for instance something like a large object in JSON) and gave it a number. They then created a memorized instances of that object where they changed what fields vary. Then instead of sending the whole list of all fields, they used their precached instances, and varied the fields they needed.

For example let's say I sent the following user object and instructed it to be memorized as xCCCC. xCCCC = {"user": { "user_name": "MikeFair" , "a_bunch_of_other_fields_that_belong_to_user": "with_their_values" , "last_seen": "an hour ago"}}

Now I can send: xCCCC : {"last_seen": "2 minutes ago"} The far side would recall the cached object, replace the "last_seen" field in the object with what I just sent, and then act as if I just submitted the whole "user" object.

MikeFair commented 9 years ago

I suppose I'll finish the examples/thoughts with two more use cases and a comment:

I've used Neo4j, CouchDB, (other NoSQL products) and they all use JSON as their lingua franca. Consequently, I've had to run many experiments where I convert SQL resultsets or spreadsheet data to JSON so I can upload it to a NoSQL database for performance and applicability testing.

It takes time to get comfortable with testing out new toolsets before I can comfortably request others commit to doing the same and I don't think I'm alone there. Which means that at least for the reasonably foreseeable future "table shaped resultsets" are going to be prime sources for JSON representation (aka arrays of relatively flat objects).

Think about a typical Node.js endpoint: get JSON request at endpoint, query a sql (SQLite) database, return results; take JSON doc upload to endpoint, commit to SQL database, return results.

These are primarily CRUD applications.

However these endpoints are typically implementing a larger document structure. Because SQL is largely joins behind the scenes, the endpoint URL gives the document context, and the JSON submitted typically represents a sub-document that belongs in that/those locations.

Along these same lines, applications like Couch could implement sub-document viewing permissions using a protocol like this.

Using the incomplete data streaming mechanisms, Couch would have a formalized mechanism for hiding sub-parts of a JSON document based on something like permissions. If I asked to operate on a part of the document that I didn't have the appropriate permission for, then it's easy for Couch to respond with an appropriate message saying in essence "you don't have access to that section" or "that section does not exist [for you]".

The specifics of these features are obviously something that Couch would need to tackle, not UBJSON; what UBJSON is providing is sub-document addressing and snippets scheme that simultaneously enables transmitting compressed binary representations of those document snippets.

What's really being requested is a way to "discuss" syncing a document between two locations/endpoints. A lot like GIT for JSON.

What I'm suggesting is that the thing we actually need is a binary DIFF representation for JSON. Other people can then take that format and make bigger/better things (like GIT does out of DIFF).

While I focused a lot on the use cases and scope for the proposed focus, to bring back to concrete reality and summarize the changes, I think the main modifications to the spec are: 1) Allow more header metadata that says "the following binary snippet should be [added to/removed from] this named item and at this location within that named item". The moral equivalent to line numbers and +/- in DIFF.

2) Allow for unspecified portions of an object. Could be simply a second version of the NOOP tag.

3) Allow for memorizing an STC template with or without data (for example: 'Object Type xF7E62 = { "f1": ..., "f2", ..., "fN": ...}' and then later on reference just the binary values 'Type xF7E62 = "Joe", 6, ..., "Bar", []}' and fill in any unspecified portions of the object. This is very much what @kxepal is proposing, I'm simply saying that it would also work over multiple subsequent documents instead of always making it be in every document transmission.

4) At least for matrix type arrays, both column based and row based dimensions, use the header to describe the 1D array approach when defining these matrices, and consider direct support for a few of the various space optimized sparse data matrix formats. Ideally something similar could be done C++/Java/dotNet/Python objects, but I don't know if those languages share common techniques. I do know that marshaling sparse matrices to dense, and then transferring them over the wire is a non-starter regardless of how good the compression techniques are (if you're still able to fit your matrix in RAM as a dense matrix then it probably isn't actually that large a matrix by large matrix standards).

Again, thank you for the consideration.

Miosss commented 9 years ago

one of the unspoken/unresolved questions seems to be "What actually is UBJSON really going to be used for?"

Well this is the most important question I have been struggling with and even mentioned it some times : )

In the issue title you referred to transport protocols - and this is precisely how I see UBJSON, as a protocol for transporting data with as little meta-data as possible; in addition to that, due to huge popularity and ubiquitousness, JSON became a "parent" for this protocol. I see UBJ as binary efficient equivalent of JSON, designed by people who understand more than just plain strings : )

But this discussion is rarely picked up by others and I am less certain over time, in what we try to achieve.

Nevertheless, having presented my approach, I think that you, @MikeFair, are over-interpreting the "transport" part. I see it more like 4th layer of ISO stack, where "transport" refers to transporting data only. I see references here to such protocols as ASN.1 family (BER, CER/DER etc.), XML (often misused as transport protocol), JSON (as transport protocol), and lots of other more-or-less standard protocols.

What you speak about is more like communication protocol - you include in the protocol the way two endpoints can communicate. It sounds more like standardizing dialects between parties in all range of possible systems.

To try to visualise this, lets try this:

In my vision I create a container for transporting, lets say, milk or fuel on the truck. That's it.

And your proposals try to include the definitions of how to drive this truck, whether it drives on left or right side of the road, etc.

If we still think of UBJ as a transport protocol in my sense, then I think your proposals are adding this semantics, which I often write as of undesired.

If UBJ shall evolve into something more, than I think it would be great to consider what you propose.

The problem is, I still do not know where are we heading : )

@thebuzzmedia, one more time, please, elaborate on our goals : )

Steve132 commented 9 years ago

In my vision I create a container for transporting, lets say, milk or fuel on the truck. That's it.

And your proposals try to include the definitions of how to drive this truck, whether it drives on left or right side of the road, etc.

YEEEESSSSS

MikeFair commented 9 years ago

Thanks @miosss, you're right about mentioning this of course, and,I'm sorry I didn't acknowledge you for it; it was your question that gave me the gumption to go ahead and propose it. And in my use case discussion I may have confused what is in UBJ and what isn't.

So to clarify, to meet the ends of being a great wire transfer format, at a high level, UBJ is both a binary JSON translation AND a binary JSON diff format. Further, that UBJ not attempt to define the "one way" to represent JSON containers, but it instead define multiple layouts that are already commonly used by real world applications but be consistent with JSON semantics.

By doing this, it is expected that applications will reduce overhead by being able to choose a layout that's either close to their own representation or is close to the representation directly used (usable) by the destination.

And that in the real world, streaming a series of UBJ formatted messages, that uses a binary layout close to how my application already uses it, by consistently differencing an existing document, where only the portions of the document that have changed/are changing, or just the portions of the document necessary for the client to do their work are transmitted is the most efficient transport methodology.

And while it does extend the description of what UBJSON is, that's ok because it's the description that fits the requirements of where UBJ is intending to be; on the wire.

MikeFair commented 9 years ago

I also want to clarify that I am not advocating UBJ as a protocol.

DIFF is a format. It is a format that describes how to transform a text document. It is useful for tracking the changes of a text document through time and synchronizing the contents of multiple text documents stored in different locations.

This is what I'm describing as the actual intention/motivation behind UBJ. The reason for making a binary JSON isn't about making JSON smaller; we make JSON smaller because that makes it more efficient to transfer (to more efficiently create a copy of the JSON document in another place) and one really effective way to make JSON smaller is using binary. Ask yourself, which is the cart and which is the horse? Does UBJ exist because we needed to compress our JSON documents to conserve bits on a disk, or does it exist because transferring fewer bytes over the wire is a more effective and palatable mechanism to create/maintain a copy of a JSON document in another location?

I think the answer is that we wanted a way to efficiently transfer a JSON document over the wire. So efficiently transferring JSON documents is the horse. "Efficiently" as I'm thinking about it means keeping both CPU utilization and the count of actual on-the-wire bytes transferred low.

And it's from there I derived the rest of the proposals...

Steve132 commented 9 years ago

This is what I'm describing as the actual intention/motivation behind UBJ.

Yeah, and in my opinion you have this completely and totally wrong. UBJ is a container and a serialization format, not a diff format. It's just not a diff format and it never was supposed to be.

It's a way of representing data in a structured and consistent way to not have to write parsers. Leave the underlying transport compression and chunking and error correction and diffing to the tools and protocols that do that.

Does UBJ exist because we needed to compress our JSON documents to conserve bits on a disk, or does it exist because transferring fewer bytes over the wire is a more effective and palatable mechanism to create/maintain a copy of a JSON document in another location?

Both/neither. These are BOTH good reasons to have a binary hierarchical container format for structured data, but neither of them are the ONLY reason to have the format.

XML doesn't exist as a way of efficiently copying data. You CAN represent data as XML and then copy it, but it's a data format. XML doesn't exist as a way to transfer data on a wire. You CAN transfer XML data on a wire, but thats the job of the transfer protocol (like http) not the job of XML.

I think the answer is that we wanted a way to efficiently transfer a JSON document over the wire.

This might be one compelling use case, but it's not the only use case, and its not the primary reason UBJSON exists, and protocol layer stuff absolutely should not be in a serialization format standard imho.

Just my two cents.

Steve132 commented 9 years ago

one of the unspoken/unresolved questions seems to be "What actually is UBJSON really going to be used for?"

Well this is the most important question I have been struggling with and even mentioned it some times : )

My answer to this question is "UBJSON will be used as a serialization protocol to serialize data where the bytestream for the JSON for that stream is bigger than it should be by a factor large enough to justify switching to a format (UBJSON) that is not well supported. For me, that factor is 4x"

In other words, I imagine that the average developer will attempt to serialize data to JSON using the ecosystem of standard tools and libraries for JSON already in his development environment, and if he discovers that the data stream is 2x larger than he expects, he won't care, he'll just use it because it's easy, standard, and well-supported. If he discovers that the data stream is 4x bigger or more, (or takes 50% longer to parse than he expects), it is only at that threshold that he/she will search for alternatives, finding UBJSON, and then integrating it because of the fact that it already closely maps to his underlying abstractions and existing serialization code.

This is why that I am SO hesitant to divorce the standard too far from JSON, (like with the schema proposal) because even though yes it might be good, average developer will just seek out something else (like BSON) which is not as good but is simpler for him to grasp at a glance and matches better with his existing json application serialization code.

My argument for the nd-arrays comes from saying "Ok, so, this hypothetical developer who is searching for alternatives because his JSON stream is 5x bigger than it should be. Who is he? What does he want? What does his app look like?" When I ask that question, the answer is clear: Because JSON does a really really good job at representing strings and objects without much overhead, if he's dissatisfied it means that his application must be using something else: Well, what would he be using that would cause the stream to be so large? OH. Numbers. Arrays of numbers. That's pretty much it too, there's not really much of a userbase of people who are unsatisfied with JSON unless they have lots and lots of small objects or arrays of lots and lots of numbers.

Where do arrays of numbers show up? Oh, images, 3D models, databases, machine learning, mapping, etc. Hmm...ALL of these are ND applications...ALL of them. Maybe that could be a way to help our hypothetical customer?

MikeFair commented 9 years ago

@Miosss

What you speak about is more like communication protocol - you include in the protocol the way two endpoints can communicate. It sounds more like standardizing dialects between parties in all range of possible systems.

It's still just a format, but Yes, the format would be for standardizing the many "dialects" applications speak because "translation" out of those dialects is a big portion of the processing inefficiencies.

So for the dialects part, the format would be for describing how to translate these binary dialects into JSON, without actually making them do it. Allowing them to use JSON as a common intermediate representation, but not requiring the applications actually generate any JSON strings.

To try to visualise this, lets try this: In my vision I create a container for transporting, lets say, milk or fuel on the truck. That's it.

RIght, and while the analogy gets a bit strange here, what I'm pointing out is that sometimes it's MILK, and sometimes it's FUEL; which both translate to the JSON "LIQUID". Rather than make the application transfer it's Specialized MILK into the JSON LIQUID container, ship it to the other side, so it can most likely get transferred back into the Specialized MILK; let them put their Specialized MILK wholesale into a JSON LIQUID container almost as-is, then put a label on the LIQUID container that says: "THIS LIQUID CONTAINER HAS MILK IN IT".

If a recipient needs/uses MILK, then it's easy; you might even be able to reference it directly using zerocopy techniques. If a recipient needs something other than MILK, like FUEL; it knows how to interpret the MILK as LIQUID so the MILK can be translated to FUEL.

Conversely, if the MILK based applications that are the source of the data know that its recipients will primarily want FUEL; then it can translate MILK to FUEL on its side, put the FUEL into the JSON LIQUID container and put the label: "THIS LIQUID CONTAINER HAS FUEL IN IT" and then ship it.

It's still a container format; a container format with labels and predefined binary representations...

I suppose in this analogy, what I'm saying is that since the most common thing UBJ is going to be used for is shipping, then make UBJ be a format for describing JSON "CRATES".

If UBJ shall evolve into something more, than I think it would be great to consider what you propose.

Thanks, and agreed, the crates part of the proposal likely fits within the existing spec/scope of UBJ; the DIFF idea however is a new, extended thing and it might be asking for too much a deviation from the current ideals (even if I think it's the right thing to do).

These ideas stem from acknowledging that applications are working with some other structure that's not JSON.

So the typical copy workflow will be: 1) Convert private structure to JSON strings (this is slow/expensive) 2) Parse/encode as UBJ (this is not bad) 3) Ship/Copy (this will be what it is and container size is the only part in UBJs control), 4) Parse/Decode back to JSON strings (again not bad) 5) Reconvert into some private structure (this is slow/expensive)

If an application wants to get around the inefficiencies of 1/2 and 4/5 then they have to write their own custom code for this part. If they're going to have to write their own code anyway, then a UBJ library isn't actually helping them much. The UBJ syntax could be useful so they don't have to do that kind of thinking; but I doubt that's actually what UBJ intended to be.

Thanks

Miosss commented 9 years ago

RIght, and while the analogy gets a bit strange here, what I'm pointing out is that sometimes it's MILK, and sometimes it's FUEL; which both translate to the JSON "LIQUID". Rather than make the application transfer it's Specialized MILK into the JSON LIQUID container, ship it to the other side, so it can most likely get transferred back into the Specialized MILK; let them put their Specialized MILK wholesale into a JSON LIQUID container almost as-is, then put a label on the LIQUID container that says: "THIS LIQUID CONTAINER HAS MILK IN IT".

I find this part of the analogy better than my original : )

But still, I believe UBJSON is indeed only container for data. Completely semantic-agnostic, without any labels (MILK, FUEL, etc.). Look at ASN.1 - it has plenty of types (around 30), including many many string types. I feel that the main strong point of JSON is very limited (yet powerful and sufficient) types set. UBJ does not extend it in any manner (not even in your metadata-driven approach) - it just enhances JSON where possible, for example in arbitrary binary data.

ghost commented 9 years ago

@MikeFair I really appreciate you pulling and stretching this conversation into an extreme direction - some interesting ideas in what you proposed (I especially thought the idea of using ... and +/- in indicate late-additions was clever) but I think others summed it up well that this is beyond the scope of what was intended for UBJSON. To be clear, I think there is real value to what you are proposing, but it is too big of a gap from my original intent (what we have now) and what you are proposing... the way you describe a lot of the functionality here, in my mind at least, assumes a lot of supporting infrastructure ontop of this new format -- parsers and state remembering types, resolving late-bound segments with ID's, etc... is it cool? Damn right it is... does it fundamentally change the developer engagement model with UBJSON? Absolutely :)

@Miosss To your question "what is our goal here" - you are exactly right, let's get clear on what the 'north star' is here and I think @Steve132 summed it up well:

"UBJSON will be used as a serialization protocol to serialize data where the bytestream for the JSON for that stream is bigger than it should be by a factor large enough to justify switching to a format (UBJSON) that is not well supported. For me, that factor is 4x"

This is exactly why I created UBJSON; this is my intent for the specification. You'll notice the focus for the spec has become laser-focused on huge efficiency wins (gaps) to be considered before we throw a wax label on this spec and call it "DONE" - I'm trying to understand any misses that will cost us years down the road. All the lower hanging fruit has been tackled (fortunately).

@Steve132 Your walk-through of the thought process of what UBJSON is missing and how you came to the ND/numbers conclusion was eloquent and to the point. I appreciate that because I think it connects the dots of your reasoning well.

MikeFair commented 9 years ago

@thebuzzmedia, thanks, I appreciate you taking the time to really consider it too.

The concept of transmitting only the deltas can be a huge space savings (the rral goal here), so if not putting it directly in the spec; have a "best practices" guide, and in there recommend people to wrap the delta instructions in an object. Something like: "doc.patch" = { "type"="assign", // or add or remove "key"="JsonVariableName", "value"=... }

And then test/check for that on the other side.

That said, rather than get keep the concept totally out, consider making it super simple: 1) UBJ is just encoding the message; not executing it. Some JavaScript interpreter or JSON helper library on the other side actually executes it.

2) UBJ only encodes these replacement messages "[=][S][=][value]"; "[=][S][+][value]"; and "[=][S][-][value]"; where [S] is the javascript based identity of the section to be replaced.

3) Use [_] to mean undefined (...) at this time.

Call it "done". :)

I'm expecting each platform already has a good JSON document library that can actually execute the replacements; especially the javascript engine in browsers where this kind of operation makes perfect sense And for the really big data stores too.

If a UBJ library wants to include sending the command to a library they aren't precluded for including it, just not ever going to be required to do so.

MikeFair commented 9 years ago

@thebuzzmedia

As for the really big data; aside fromlong numerical arrays, arrays of objects are also really horrendous space/speed offenders as they don't pack well, have lots of repetitive strings, and switch type context often (preventing slurping up large amounts of the same type data at once).

I've got a couple proposals I'm expecting to put on both those issues.

ghost commented 9 years ago

@MikeFair for diff I think there are two solutions that I we won't significantly do better than by adding it to the spec -- straight binary diffing algorithms or just representing JSON Patch with UBJSON. Because this problem has solutions, I want to leave it in the background for now.

To your second point, yep, this is my focus at the moment... arrays of objects are hugely wasteful.

I am trying to formulate an optional header 'type' that can appear at the beginning of arrays or objects that describes the types of all the values at the least and in the case of objects, would define both the labels AND the types, so the body would only contain values.

It's basically a schema, heavily inspired by Alex's proposal, just with formatting more akin to what we currently have.

I'm still working through the recursive cases.

MikeFair commented 9 years ago

@thebuzzmedia

Presenting UBJ with JSONPatch data sounds good enough. :) On Mar 10, 2015 11:13 AM, "Riyad Kalla" notifications@github.com wrote:

@MikeFair https://github.com/MikeFair for diff I think there are two solutions that I we won't significantly do better than by adding it to the spec -- straight binary diffing algorithms or just representing JSON Patch https://tools.ietf.org/html/rfc6902 with UBJSON. Because this problem has solutions, I want to leave it in the background for now.

To your second point, yep, this is my focus at the moment... arrays of objects are hugely wasteful.

I am trying to formulate an optional header 'type' that can appear at the beginning of arrays or objects that describes the types of all the values at the least and in the case of objects, would define both the labels AND the types, so the body would only contain values.

It's basically a schema, heavily inspired by Alex's proposal, just with formatting more akin to what we currently have.

I'm still working through the recursive cases.

— Reply to this email directly or view it on GitHub https://github.com/thebuzzmedia/universal-binary-json/issues/64#issuecomment-78113794 .

ghost commented 9 years ago

@MikeFair Good deal! I do appreciate the discussion though, every new idea moves the spec in interesting directions that have all been better than I originally defined.

I'll close this issue, but please re-open if you want to continue discussion along this path.