Peeking at the version string needed to parse JSON/CBOR/MGPK lengths (without an incremental parser) not quite clear in current spec

daidoji commented 7 months ago

The current implementations of CESR parsers peek at the version string in order to determine the length of the serialized payload (for JSON/CBOR/MGPK) in order to read off the stream deterministically (giving length to all elements making this a true TLV spec for all codes which is probably good for the implementers anyways).

However, maybe I'm a slow reader but I've been through this spec 10-20 times by now and didn't realize this was happening until now the way things are currently written (hadn't even really looked at these part of the implementations). It wasn't until I looked at the implementations that I realized this was happening. Before I was going to incremental parsers as I thought that was the original intent of the spec.

After that fact jumped out at me things became clear but I feel like there's room for more clarity in the spec.

Improvements or fixes for this issue might be:

Special JSON/CBOR/MGPK sections describing various implementation choices of incremental parsing or peeking at the version string in order to read off the stream as well as the intent and thought behind sniffing the version string vs adding another set of codes.
Noting that in order to be a full TLV we essentially need the version string to be the first field as mentioned in the annex without our choice to add a corresponding code
Moving the version string and SAID sections out of the annex if these are normative requirements for parsing JSON/CBOR/MGPK in the current scheme
Doing anything else that signals the spec author's (intention) and make it easier for implementers to realize that this is part of the stream and these are normative requirements.

This is related to #60 but not necessarily exactly the same.

daidoji commented 7 months ago

I had a thought a little while later that maybe the normative requirement is that CBOR/MGPK/JSON be parsed incrementally OR a sniffing method similar to the SAID calculation must be specified by genus.

SmithSamuelM commented 7 months ago

@daidoji How is this paragraph that begins the section on version string not explicitly telling the implementer exactly what you are confused about?

Non-CESR serializations, namely, JSON, CBOR, and MGPK when interleaved in a CESR Stream shall have a Version String as their first field with field label, v (lower case "v"). The Version String field value enables the Stream parser to use a regular expression parser to determine the type and length of the interleaved serialization.

This is more explicit than "incremental parser".

This is already been explained in more detail in the cold start section above

A parser merely needs to examine the first Tritet (3 bits) of the first byte of the Stream start to determine which one of the seven it is. When the first Tritet is a Count Code, then the remainder of the Count Code itself will include the additional information needed to parse the attached group. When the first Tritet indicates its JSON, CBOR, or MGPK, then the mapping's first field must be a Version String that provides the additional information needed to parse the associated encoded field map serialization fully.

daidoji commented 7 months ago

@SmithSamuelM of course I agree completely. However, it took me a few read-throughs in order to arrive at that conclusion. This version string section is buried at the bottom of the document in the annex and doesn't seem to relate to the main specification at all. The rest of the document we refer to CBOR, MGPK, JSON as if all possible serializations in those formats are supported in the CESR stream but due to the cold start rules and how they fit into the TLV scheme don't quite make sense until you find the paragraph you pointed out.

Hence my inquiry about incremental parsers (that was my first guess for a while and a wrong one for sure).

I think for the reader we can make this clearer but I wasn't quite sure how to put together something in a PR so I made this issue instead. I can attempt a revision if you'd like.

SmithSamuelM commented 7 months ago

I think that cross references between to the two sections will aid the reader. So for example, at the end of the cold start section paragraph I quote above, add a sentence that says something like See the xxx annex for the syntax of the version string field.

And in the version field field section in the annex in clude a reference back to the cold start section. such as. See the xxx section for how a stream parser might employ the version string field to extract JSON, CBOR MGPK.

I was trying not to divert the users attention from solving the cold start problem by introducing the full details of the version string and also not redundantly explaining the cold start solution when detailing the version string format.

But for a first reader the two are not well connected so adding cross references at least gives the reader a hint of how they are related.

daidoji commented 7 months ago

Sounds good to me. I think it would help a lot for the first time reader.

SmithSamuelM commented 7 months ago

fixed with #73

trustoverip / tswg-cesr-specification

Peeking at the version string needed to parse JSON/CBOR/MGPK lengths (without an incremental parser) not quite clear in current spec #64