Clarifications re chunking

kengruven commented 3 years ago

I'm implementing chunking, and the CBE spec seems ambiguous on a couple points.

First, what are the units of the "length" field?

It says "All array length fields represent the number of elements, not bytes"
It also says the chunk header length starts with "Chunk length (in octets)"
Later, it says "The number of octets following a length field MUST match the length element size (e.g. 1 octet length for 8-bit elements, 2 octets length for 16-bit elements, 8 octets length for 64-bit elements, etc)."
But then the Bit Array section makes it clear that the length of a bit array is the number of bits (so the size won't exactly match length * size)

It looks like the 2nd item here is a mistake, and the 3rd item is true if you round up the size for bit arrays. (Is that the only array/chunked type which isn't a multiple of 8?)

Second, strings (and string-like objects, like resource identifiers) are chunked, and the "length" is the number of octets. Arrays can be split into chunks at any point between elements, but not within an element. Since the elements of strings are octets (for the purposes of length), does that mean that strings can be split into chunks between any two bytes?

Empirically, with "enctool" today, the answer seems to be No, you can't split a codepoint across chunks. But I don't know if this is by design or by accident. I also haven't tested combining characters or anything else like that.

My test cases: [03 01 91 05 c3 b6 02 5a] works, while [03 01 91 03 c3 03 b6 02 5a] fails.

kstenerud commented 3 years ago

Hi Ken,

The base idea is that the length represents the number of elements since the decoder can just do a simple calculation to get the byte length from the element length (1:1 for 8-bit elements, 2:1 for 16-bit elements, etc).

This breaks down with string data since each UTF-8 character can be a different size, so I solved that by saying that all strings have an "element size" of 1 byte. There should be no problem splitting characters across chunks; that would be a bug.

Bit array lengths have an element size of 1 bit, so it fits 8 elements per byte, with any remaining unused bits in the last byte cleared. Actually, re-reading this part of the spec I think I forgot to require that all but the last chunk be a multiple of 8 elements in a bit array (which it absolutely should, otherwise it would be a nightmare to piece the array back together in the decoder! Imagine the bit shifting nightmare trying to concatenate a 1023 bit chunk to another 1023 bit chunk to form the final bit array!). I'll work on clearing up the spec on these points.

So basically:

String-like array element size is ALWAYS 1 byte, and strings can be chunk-split anywhere.
Bit arrays have an element size of 1 bit, meaning that there are 8 elements per byte in a chunk (and all but the last chunk MUST be a multiple of 8 bits in length).
8-bit typed arrays have an element size of 1 byte.
16-bit typed arrays have an element size of 2 bytes.
32-bit typed arrays have an element size of 4 bytes.
and so on

Also, please beware that I made what I pray are my LAST breaking changes to the spec recently:

All entries in plane 2 have been moved around to support short form arrays. Bit and unsigned 8-bit arrays have been moved to plane 1 so that they don't cause so much overhead. Identifiers weren't supposed to have a full type field of their own since they're always part of another object and they're always strings.

kstenerud commented 3 years ago

Actually, the more I think about it, maybe it is better to always require string chunks to always end on a complete character? It certainly simplifies the decoder for the general use case, at the cost of the encoder side maybe needing to buffer depending on how it's fetching the string data. On the other hand, letting characters be split means that for the user there's far less potential for a rejected document if they screw up the character edge detection when sending a document... Full UTF-8 aware software is difficult to get right...

kstenerud commented 3 years ago

BTW I've added a new feature to enctool to make binary repro cases easier to talk about and reproduce:

Enctool's convert mode now supports interpreting input as text-encoded byte values, using -x for hex encoded, or the more general -t which supports decimal format and 0xff style hex. For example:

$ echo "03 00 ff" | ./enctool convert -x -sf cbe -df cte
c0
-1
$ echo "0x03, 0x00, 0xff" | ./enctool convert -t -sf cbe -df cte
c0
-1
$ echo "3, 0, 255" | ./enctool convert -t -sf cbe -df cte
c0
-1
$ echo "3 0 0xff" | ./enctool convert -t -sf cbe -df cte
c0
-1

kengruven commented 3 years ago

Right, I think there's no universal "better" here.

If we say that strings can break anywhere, then the encoder is easier (can buffer and split anywhere in the byte stream, including after UTF translation), but the decoder is harder (have to buffer bytes and make sure we get full codepoints to send to the UTF translator). If we say that strings can only break between codepoints, then the encoder is harder (has to know a bit more about Unicode), but the decoder is easier. Maybe. It depends on your system's architecture, and it might make no difference at all, depending on how you build your CBE library and what system interfaces you have available.

AFAICT, the only benefit to chunking is for encoders who don't know how large a string/array is going to be. Decoders who encounter a gigantic chunk are always welcome to process only part of it at a time, if they wish.

So dealing with multiple chunks is purely optional for encoders (they're always welcome to just output one big chunk), while it's mandatory for decoders (it's a feature of the format). On that basis, I'm inclined to bias towards preferring decoder simplicity.

That is, if you want to write a string in multiple chunks, it's up to you to only break on codepoints. To read strings, you're guaranteed that each chunk contains only complete codepoints.

I don't think this will be burdensome for the vast majority of encoders, anyway. If you're streaming a lot of text, your internal interface probably accepts Strings or Characters, not Bytes. There are situations I can imagine which would be hindered by this, but they're uncommon, and there are always workarounds.

kstenerud commented 3 years ago

Yeah that makes sense.

Also, enctool convert now has -X (output in hex chars aa bb cc) and -C (output C-style 0xaa, 0xbb, 0xcc). I got tired of manually converting when generating test data :P

kengruven commented 3 years ago

That's a good idea. I'm stealing it!

kengruven commented 3 years ago

As of 8b063022, the spec requires chunks to end on a character boundary, and says "1 element per 1 byte" for strings, so I'm calling this resolved. Thanks.

kstenerud / concise-encoding

Clarifications re chunking #19