nanosai / rion-ops-java

RION Ops for Java is a toolkit for reading and writing RION encoded data. RION is a compact, fast binary data format.

43 stars 7 forks source link

Specify "shared values" #2

Open hvbtup opened 4 years ago

hvbtup commented 4 years ago

The specification mentions the extension field types copy, back-ref and forward-ref (as not finished). This could turn out to be an outstanding feature, but not even an idea of how this could work is given.

Consider the following use cases, which is quite common:

In a table (for example the results of a SQL query), very often the value of a column is the identical to the same column in the previous row.

With the existing solution the value has to be repeated nevertheless. This wastes band-width and memory when processing/reading the object (for example in Java, with N rows where column X has only M<=N distinct values (where often M<<N), it would create N string objects, many of them with the same value, where M would be enough.

On the application side, this could be optimized for example by using a magic value like utf-8 "" (two double quotation marks) or a special int64-value (meaning: see previous row), but then the creator and the consumer of the table would have to agree on this magic value and take care of it in the code.

I propose that inside a table, the extended type back-ref with a null value officialy means: Take the value from the same column in the previous row.

Also it is clear that by using back-references of any kind, the consumer has to keep the previous objects in memory to a certain amount. In case of my proposal for using a back-ref to the previous row in a table, it suffices to keep the previous row (and the objects it created) in memory. The same for the writer: It can automatically create back-references to the previous row if it still knows the (data source for) the content of the previous row.

The idea could maybe be extended a bit further for arrays ob objects, but I think it is specially useful to avoid data duplication in tables, saving band-width and memory.

For generic back-references, things would be more complicated, e.g. one would have to invent something XPath-like to specify what it referenced - or one would need the tagging concept like CBOR.

jjenkov commented 4 years ago

Hey, thank you for the detailed feedback!

We decided to postpone the exact definition of the Copy and Reference fields in order to get RION v. 1.0.0 out. The reason was, that while we had some ideas for how these fields should work, they were not ideal, and could definitely need some further analysis, so we don't just throw some half-baked specification out there, which would later be really difficult to change because of backwards compatibility issues. After all, Copy and Reference are probably less used than the rest of the fields we have defined, so we thought postponing their definition was acceptable.

Having said that, I agree with you that these two types of fields can be really useful. The Copy field was intended for copying a previous field, thus creating a new "object" (if not a primitive value) having the same value as the copied field. The Reference field was intended for being able to reference the exact same value as listed previously, for instance when serializing an object graph with circular references - or just multiple references to the same object in the graph.

Our first approach was to let Copy and Reference contain a relative byte index value pointing N bytes "back" to the RION field to copy or reference. However, that requires that you know the index of the field to copy or reference. That again means, that you need to save that index when you write that field the first time, and at that time you might not know that this field will be referenced or copied later on. A crude solution would be to store the index of all fields / objects written, but that would slow down writing quite a lot.

Your solution of having some kind of "repeat field from previous row" mechanism in tables is a good idea, but what if the rows consists of 2 sets of repeated values, mixed in between each other? E.g.

1 #abc #123

2 #def #456

3 #abc #123

4 #def #456

I guess one solution could be to specify the index of a specific row to get the field value from, although this again complicates the solution.

Alternatively we have recently investigated the idea of using tagging like CBOR does, like you also suggests, but we have not completed the analysis yet.

Anyways, Copy and Reference can potentially be really, really useful and help save a lot of bandwidth, but they need to be implemented in a way that doesn't cost a lot when writing them - especially if you do not actually use Copy and Reference in your specific use case.

hvbtup commented 4 years ago

Yes, a general and automatic back-ref optimization while writing would be nice to have for saving band-width, but this would slow down writing and require more memory while writing - and also for reading, because objects that have been read could not be released until the whole RION object is processed, because they could be back-refed later on. So I think such a general solution is not reasonable.

But the special case of "value from previous row is repeated" in tables happens probably so often in real-world data exchange, that handling it with a special code makes sense.

Think of SQL queries for example: In most cases, the results are sorted by a handful of columns. These columns would benefit from this optimization. Or if you look at CSV files. Maybe one column contains a date. Quite often, several subsequent rows share the same date even if the data is not explicitly sorted by the date column.

In your example the shared values are not in subsequent rows. As you said, this would complicate things, because it wouldn't be clear how many of the preceeding rows the consumer has to keep in memory. That's why I propose the not-quite-so-powerful optimization for back-refs to one previous row only: For reader and writer it's clear then that the exactly one previous row has to be kept in memory. So the memory overhead is neglectible.

I haven't yet thought about what this enhancement means for the API - would it require an API change?

hvonbargen commented 4 years ago

Hmm... I thought a bit more about it while walking the dog.

It is clear that creator and consumer of a RION object have to use a slightly more complex program logic when back-references are used. On the other hand back-references are very useful for reducing size and memory overhead in many cases and are needed for cyclic data structures.

Luckily, the encoding as specified now does not specify what to do if the data contains illegal values. An illegal value would be e.g. a UTF8 string that ends after the first byte of a two-byte codepoint like Ä, or a tiny Boolean with a vale other than 0, 1 or 2 (null, false, true).

So, I propose the following:

Add to the specification that creator and consumer of a RION object have to agree about the handling of illegal values and that when just transferring or saving a RION object it must be transferred/saved as-is, including any illegal values.

Add support for writing illegal values (as a byte[]) to the writer code and for reading illegal values to the reader code (for the reader, this may already be possible, I don't know).

Then we could take my use-case as an example in the documentation for how this could be used: We use the illegal Boolean value 3 to indicate that the value is the same as... .. in the same column of the previous row (for tables) ... In the same property of the previous object (for arrays of objects). This way, the "same as above" indicator only takes a single byte (which is optimal), and from looking at the yet-to-be-written example code for reading and writing it is clear that this magic value is not part of the standard and that both ends have to take it into account in the writing/reading logic (while the pattern is always the same).

hvbtup commented 4 years ago

Related to this proposal are two other proposals:

First idea: Specify the usage of Tiny values 3 to 15

Old: Code 1 Encoding=Tiny, the 4 length-length bits describe a boolean value: 0 (null), 1 (true) or 2 (false) New: Code 1 Encoding=Tiny, the meaning depends on the 4 length-length bits. The values 0, 1, 2 mean a boolean value as before. The values 3 to 7 are not specified, they can be used for other purposes (see the example for usinge it as a as "same-as-above" indicator). The values 8 to 15 are reserved for future use.

Second idea: Introducing markers: A marker sets a mark (sic!), it is not normal data. For example, a 3x2 (+ header row) with a marker on the 1st column of the second data row may look like this (the * shows the marker):

Name   Age
Alice   25
*Bob    28
Bob     31    # someone else

This could be encoded as

TABLE 3 # number of rows Key "Name" # column 1 header Key "Age" # column 2 header UTF-8 "Alice" # value for cell A1 Int64 25 # value for cell B1 Marker # A tiny marker UTF-8 "Bob" # value for cell A2 Int64 28 # value for cell B2 UTF-8 "Bob" # value for cell A3 Int64 28 # value for cell B3

The mark belongs to the next data field. For example, a marker can be used to specify that the next data item may be a target of a back-reference.

There are tiny markers and named markers. The table example above used a tiny marker.

Tiny markers: Length-Length: meaning: 0 NULL boolean value 1 TRUE boolean value 2 FALSE boolean value 3 to 5: Not specified, can be used for private purposes. This is considered a value. (see the example for using 3 as a as "same-as-above" indicator). 6 to 7: Not specified, can be used for private purposes. This is considered a marker. 8 to 11: Reserved for future use. This is considered a value. 12 to 15: Reserved for future use. This is considered a marker.

The yet unassigned type code 8 is for named markers. Except from the type code, the encoding is the same as Key-Short (so the marker has a "name" which may be up to 15 bytes long).

If a RION object contains any markers at all, the very first thing after the RION length must be a tiny marker with the value 15 ( the byte value 1F). This informs a reader that it has to take care of markers while reading the object.

Markers could be used for back-references in a more flexible manner than the "same-as-above" example:

A relative back-reference would have an int64-positive value. The meaning is "refer to the n-th last marker", e.g. 0 is the last marker before this object, 1 is the second-last marker etc. This way the numbers tend to be very small on average compared to using the index from the first marker.

In the table example above, we could use the back-reference 0 instead of UTF-8 "Bob" in cell A3.

A named back-reference would have a Key-Short value. It references the marker with the given name.

jjenkov commented 4 years ago

Sorry about the late reply!

Originally I was thinking about using RionFieldTypes 8 and 9 for Copy and Reference. These two numbers in the RION Field Type series (0 to 15) are still unused.

The idea was, that the "body" of these fields would contain a positive number which was to be subtracted from the beginning of the Copy or Reference field, to the start index of the field to Copy or Reference.

This implementation would leave it to the user (or API) to figure out how to detect what fields to copy or reference and their indices.

As alternative I was thinking of having an "Identifier" Field like CBOR has a label field, but with an identifier as body instead of a semantic meaning as in CBOR. This ID could then be referenced later on.

I am still on the fence, since I have not found an elegant way to detect objects that are repeated. In any case, somehow the writer of the RION fields must keep track of RION field indices of fields that might be referenced later on. Whether to insert an Identifier field in front of them, or store their indices in a Map somewhere I am also still undecided about.

Two advantage of using a relative index as back-reference to an earlier field instead using Identifier fields and reference fields are:

1) If it turns out never to be necessary to back-reference any previous field, no unnecessary data is added to the RION field stream. This means smaller RION streams. I believe that there is no difference in performance whether the writer needs to keep a Field-to-Identifier or a Field-to-Index Map. Both identifiers and indices would probably be int's in most cases.

An Identifier might be a smaller number to embed in a Reference field later - especially if the RION stream is long and you need to back-reference far. However, that saving would probably be lost because an explicit Idenfier field was needed in the RION stream too. An Identifier field + a Reference field as opposed to just a Reference field.

2) When reading the referenced RION Field it is most likely faster to jump directly back to the given index, rather than first having to store the Identifier indices in a Map when encountered in the RION stream, and later have to lookup the index of a given Identifier in that same Map.

One issue I have been pondering about though is, if it would be possible to only have a single field which could be either a Back-Copy or a Back-Reference, as to only use 1 of the 2 free RION Field Types. However, doing so would limit e.g. a Copy back reference to only up to 8 bytes in length, and a Reference up to 7 (=> If Length Length 9 to 15 - subtract 8 from Length Length and assume a Reference instead of Copy).

One issue I have been thinking about is, that it does not make much sence to reference earlier fields if the field value is very small. Simply copying the field is more efficient then (e.g. a small Int of say 13 (1 byte) ) than encoding a relative index (e.g. 4096 bytes backwards). Of course, this might not be possible if you really need to back-reference a specific field while sending an object graph across the network. Then you truely need to indicate that it's a reference to the same object, not just a copy.

jjenkov commented 4 years ago

One final alternative / variation would be to have a Label field (not an Identifier field) which changes the semantic meaning of the following field. For instance,

Label[1] #Int64[449]

Label[2] #Int64[449]

The Label field with value 1 could mean "the following field is a back copy field". The Label field with value 2 could mean "the following field is a back reference field".

Etc.

However, there is also the field extension mechanism for that purpose. Extended fields - RION Field Type 15. It just works differently, but perhaps the field extension mechanism should be changed to the one described above... not sure. It's a matter of achieving the most compact encoding possible.

hvbtup commented 4 years ago

Sorry for the late reply. I will try out some of the ideas in my code and then inform you about the experience.

As a first step, I am in the process of defining the LION format (an acronym for Language-Independent Object Notation and it sounds cool :-). The draft version of this specification uses the RION format with additions for back-references, but it is not yet stable.

I wrote a not-yet-performance-optimized and not yet API-stable implementation of a limited subset in Oracle PL/SQL. Today, this only supports boolean, integers (up to 32bit), strings and tables. I tested this with writing and reading a simple structure first, then I created a BLOB with a table of 8000 rows with 8 columns with this PL/SQL LION writer and managed to parse it with the RionReader implementation. So far, so good. It turned out that the naive PL/SQL implementation is too slow to be useful, but this is probaly because it directly reads and writes very small portions of the BLOB, so I am sure this can be improved Alternatively, I could use Java inside the database. Anyway, it enables me to store data-structures in a serialized form inside the DB in a format that is more compact then XML or JSON. OTOH the Java code is very fast 👍.

While developing the PL/SQL code, I noticed the following: The RION format specification is very good and the java implementation is very straightforward (well, Java purists might say it is a little bit too straightforward - I am none of those). A a few points could be improved IMHO:

When first reading, I thought that a RION object would contain an over-all length info at the beginning, but this is not the case. Only the items inside the RION objects have a length info.
An example of reading and writing a table would be useful. It was not clear to me that it is actually very easy by using moveInto/moveOutOf.
An example of skipping an item would be useful.
Regarding tables, the docs say that the first line SHOULD contain the column names as (short) KEYs. It should be noted that this is strictly necessary if the reader does not know in advance how many columns the table has (in other words: omitting the keys would violate the "self-describing" rule), and that this implies that the first column of the first data row cannot contain a (short) KEY.

As said before, I will try out some of my proposals this winter, then tell you about the outcome, and share the code.

jjenkov commented 4 years ago

LION :-D ... fantastic name and abbreviation :-D ... We should have come up with that ;-)

No no ! The RION Object field also has length bytes in the beginning - just like the other fields!

You are right, there are missing a few examples in the documentation! Also how to write RION Array fields. Actually, I am not sure if there is even a need for an explicit Array type of field... it is intended for lists / arrays of the same kind of RION Fields (e.g. an array of integers). But we have no restrictions on the fields inserted into an Array field.

You are right - in theory the reader could know the column names and count per row already, and thus field names (KEY fields) could be left out. We also allow for that optimization for RION Object fields - when messages need to be more compact. Then only the value fields are sent across the network. No KEY fields.

If you want - I could spend some time expanding the docs with more examples etc. for how to write RION Table and Array fields, and also how to read them again.

The reason I haven't done that yet is, that RION Ops is part of a suite of toolkits / ecosystem designed for implementing distributed systems, using asynchronous communication, non-blocking IO, flexibility in architecture choices etc. etc. I am currently working on the thread execution model for this (POC working) and I was planning to document that before going back to RION Ops docs :-) But I can switch that around if it helps you!

jjenkov commented 4 years ago

Regarding performance, I've spent a lot of time thinking about performance and compactness of both RION and of the code reading and writing it. It should be reasonably fast. Not saying it cannot be made any faster. Maybe there are some techniques that I don't know of that could improve speed. But it's quite decent currently.

hvbtup commented 4 years ago

Probably you misunderstood me:

The performance of the Java implementation is excellent!

The performance of my PL/SQL implementation is a nightmare, but that's caused by the way I use LOBs and can be dramatically improved (though it won't reach Java's performance).

E.g. I have a class that contains a structure like this: Map<String, Map<Tuple<String, String>, Map<String, Object>>>. This map contains entries for UI properties, a total of 20000 Properties. The majority of the property values are Strings.

And the Tuple was really an EclipseLink entity with the special property that the first element is always the same value as the outermost key (as the data was sorted and grouped that way).

This was serialized/deserialized with the standard Java serializer. Deserialization with the standard Java Serializer took about 2000 ms on my machine (and I don't know what size the serialized data had, but probably "big").

I decided to write a special writeObject and readObject for the standard Serializer, and inside these methods, I used the RION format. Well, it took some lines of coding for the 3-level map, but it was worth the effort:

The deserialization with the RION Reader took only 13 ms for the same data (and the serialized data was only about 500 KB in size)

jjenkov commented 4 years ago

Hey - didn't see this message until now! Cool :-) Well, like I wrote earlier, we've spent a long time analyzing and benchmarking - both during the design of RION itself, and the RION Ops classes too.

jjenkov commented 4 years ago

By the way, did you see that I have added a RionObjectWriter and RionObjectReader which can write Java objects to RION and read RION into Java objects in version 0.7.0 ? These classes use Java Reflection to serialize and deserialize Java objects. They are a bit easier to work with than trying to do it all by yourself. Unfortunately, using Reflection is a bit slower ( x 0.5 at least) of using the raw RionWriter and RionReader. But it is so much easier :-D

These Reflection based classes still have some limitations though, which I hope to overcome in the future.

Also, I have created a separate issue about adding a Reference field which can be used to represent a reference from a field somewhere back to another field earlier in the same RION block. This is the first step in adding "cyclic object graph" support. Something you don't get with either CSV, XML, JSON, YAML, MessagePack, CBOR, Avro, Protobuf or Amazon ION.

Let me know if you are still playing with RION, as we are ramping up the RION efforts from 2020 and forward.