iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

structure of "other stuff" #1

Open crotwell opened 7 years ago

crotwell commented 7 years ago

Within the old strawman, there was a transition from blockettes to just having essential stuff in the mseed header, and with an "other stuff" section for things that might be useful but were not absolutely essential. The format was just simple ~ delimited strings, allowing things like name=value~

I feel like there should be a more structured way of doing this. Two ideas, first is messagepack. It is binary, pretty simple and would allow objects/arrays in addition to name-value. Lots of language bindings and writing a writer or parser feels easy enough to me. http://msgpack.org/

Second idea, as I mentioned at the meeting, would be to just use json. At least you get a lot of structure and self description, plus separation between numbers and non-numbers. http://json.org/

I prefer messagepack, given that mseed is binary and so maybe the other stuff should be as well, but not strongly. Json is pretty nice and if we are going to have strings anyway, the extra overhead seems small.

Thoughts?

crotwell commented 7 years ago

@andres-h

In which aspects is CDDL better than JSON schema? If CBOR is just a binary representation of JSON, one could use JSON schema for CBOR and vice versa.

CBOR is not just a binary json. It uses similar ideas, but has more structure. For example, json only has "number" but cbor has floats and ints, etc. Similar to the idea you wouldn't use xml-schema to validate json, the schema language and the format need to be tightly related.

chad-earthscope commented 7 years ago

Did some more reading, CDDL does not appear to be that useful for extra headers.

Bummer. I don't think we'd want to use it if it couldn't be for the whole thing. Two schema languages sounds terrible. I wouldn't be opposed to JSON Schema as an expression of the schema, we have to explicitly state that the non-normative conversion to JSON must be used to be used for validation.

Also, stumbled into some more information on the messagepack cbor relationship.

Yeah, read through that too, definitely not warm fuzzies.

Just some numbers of tagged messages on stackoverflow:

An interesting perspective, definitely not as popular from that POV, it's also newer. There are a lot of implementations, so a lot of folks have spent time on it, implying that it's not just a small group pushing their idea: http://cbor.io/impls.html

Even though it seems to have some traction, there is some risk that it will die on the vine. Whereas Protobuf has already matured and will not be going away anytime soon.

A group that invented an alternative called ION did a comparison including MessagePack, CBOR and Protobuf: http://tutorials.jenkov.com/ion/ion-vs-other-formats.html http://tutorials.jenkov.com/ion/ion-performance-benchmarks.html

Those are worth reading.

andres-h commented 7 years ago

Some links: deleted

We are referencing that thread now :D

msgpack

crotwell commented 7 years ago

Github is being a little too clever for its own good! :(

I edited and deleted that link as I don't think we want to part of that larger conversation.

@andres-h maybe delete the link in your quote as well?

chad-earthscope commented 7 years ago

I edited and deleted that link as I don't think we want to part of that larger conversation.

Agreed.

andres-h commented 7 years ago

OK, no problem, link deleted. Let's see if the reference disappears too ;)

andres-h commented 7 years ago

So the guy hijacked MessagePack and named the format after himself (C. BORmann). Wow.

chad-earthscope commented 7 years ago

In the description comparing/contrasting ION with other formats at http://tutorials.jenkov.com/ion/ion-vs-other-formats.html is this:

ION, CBOR, MessagePack and JSON are all self describing, meaning you don't need an external schema to read them. This is essential for a network protocol where intermediate nodes may have to route messages along to other nodes. According to Protobuf's own docs you cannot see where one Protobuf message ends and the next begins, meaning Protobuf is not fully self describing. You can see where the individual Protobuf fields start and end, but not the full message.

The fact that Protobuf is not fully self describing makes it unsuitable as a network protocol message format (although you could route Protobuf messages inside other types of messages). That a data format is self describing also means that it is possible to convert a file of these formats to a textual format (JSON is alread textual) to see what is actually stored in the file.

That gives me pause about Protobuf. I'm not sure if those are weaknesses for how we might want to use it or not, maybe @andres-h has worked around them somehow?

andres-h commented 7 years ago

On Tuesday 2017-07-18 23:40, Chad Trabant wrote:

In the description comparing/contrasting ION with other formats at http://tutorials.jenkov.com/ion/ion-vs-other-formats.html is this:

 ION, CBOR, MessagePack and JSON are all self describing, meaning you don't need an external
 schema to read them. This is essential for a network protocol where intermediate nodes may
 have to route messages along to other nodes. According to Protobuf's own docs you cannot see
 where one Protobuf message ends and the next begins, meaning Protobuf is not fully self
 describing. You can see where the individual Protobuf fields start and end, but not the full
 message.

 The fact that Protobuf is not fully self describing makes it unsuitable as a network protocol
 message format (although you could route Protobuf messages inside other types of messages).
 That a data format is self describing also means that it is possible to convert a file of
 these formats to a textual format (JSON is alread textual) to see what is actually stored in
 the file.

That gives me pause about Protobuf. I'm not sure if those are weaknesses for how we might want to use it or not, maybe @andres-h has worked around them somehow?

This is exactly the strength of Protobuf!!!

This means, when you concatenate two Protobuf messages, the result is a Protobuf message! That's why SeedLink can transfer individual blockettes for low latency!

That's why we need the record header that contains record length!

This is not a theory, but works in practice as demonstrated by my Python and Javascript implementations!

Thanks to Protobuf, it is trivial to convert the whole miniSEED record to Javascript object (JSON), which makes handling the data in Javascript super easy. No more messing with DataView, CBOR, etc...

chad-earthscope commented 7 years ago

This is exactly the strength of Protobuf!!!

OK, great, I guess; it's confusing because what you said is the opposite of what they wrote. They implied that Protobuf was not self describing and therefore could not be converted to a textual format. But maybe this isn't a problem when the schema is proscribed like we are doing with miniSEED.

Thanks to Protobuf, it is trivial to convert the whole miniSEED record to Javascript object (JSON), which makes handling the data in Javascript super easy. No more messing with DataView, CBOR, etc...

I think that can be done for CBOR too with: https://github.com/paroga/cbor-js. What is unique to Protobuf in this regard?

crotwell commented 7 years ago

I'll say again, the problem with protobuf for the extra headers is that unless you have the .proto file that corresponds to those particular extra headers, all that you know is an integer field number, a byte length and an array of bytes. Protobuf does not carry with it enough information to do what I called "parsing without meaning". You can't determine even the type of an unknown field, ie float, int, or string. Also Protobuf is simply not a key-value format because the keys are not part of the format.

Moreover, if two vendors or data centers both decide to add an extra header and both add field number 17 to their extended .proto schema file, they have created two incompatible miniseed records. The only way around this would be a global registry, and I very strongly object to creating a long term archive format that requires an dynamic external global integer registry.

Protobuf looks really nice as a over the wire format where you have significant control over both ends, which is what it was designed for, but not as a global long term archiving format.

For the stuff that must be there, protobuf is too much, we should just lay out the bytes in a fixed header. For the stuff that is extra, protobuf is too little, we need a self-describing key-value store.

andres-h commented 7 years ago

On Thursday 2017-07-20 01:29, Chad Trabant wrote:

But maybe this isn't a problem when the schema is proscribed like we are doing with miniSEED.

Exactly.

 Thanks to Protobuf, it is trivial to convert the whole miniSEED record to Javascript object
 (JSON), which makes handling the data in Javascript super easy. No more messing with
 DataView, CBOR, etc...

I think that can be done for CBOR too with: https://github.com/paroga/cbor-js. What is unique to Protobuf in this regard?

  1. I was talking about the whole record. You can do it with extra headers only, not with SimpleHeader.

Moreover:

  1. With Protobuf you can transfer partial records, because each field (blockette) is a Protobuf message on its own.

  2. With Protobuf you can append fields (blockettes) to the record without modifying existing data (and invalidating CRC).

  3. With Protobuf you can add more fields without using extra headers as a hack (https://github.com/iris-edu/mseed3-evaluation/issues/21#issuecomment-313936532).

  4. With Protobuf you save space, because you don't repeat the field names over and over again.

  5. Protobuf is so simple you can implement your own parser in a day. You can include the encoding description in miniSEED standard like it is done with Steim encoding.

Would be funny to see the new miniSEED standard with 53-page CBOR RFC in the appendix :D

chad-earthscope commented 7 years ago
  1. I was talking about the whole record. You can do it with extra headers only, not with SimpleHeader.

Not unique to Protobuf.

  1. With Protobuf you can append fields (blockettes) to the record without modifying existing data (and invalidating CRC).

Appending data items is not unique to Protobuf. Whether it invalidates CRC depends on where the CRC is locate and what it is calculated for. If adding information to a record does not "invalidate" the CRC, then it was not a CRC for the whole record, which is a whole other discussion.

  1. With Protobuf you can add more fields without using extra headers as a hack (#21 (comment)).

Not unique to Protobuf.

  1. With Protobuf you save space, because you don't repeat the field names over and over again.

Not unique to Protobuf.

  1. Protobuf is so simple you can implement your own parser in a day. You can include the encoding description in miniSEED standard like it is done with Steim encoding.

I agree this would be valuable. I'm not sure it's unique to Protobuf though.

Would be funny to see the new miniSEED standard with 53-page CBOR RFC in the appendix :D

With an RFC it's a stable document and we can just refer to it I would think. The CBOR RFC is quite comprehensive with lots of details beyond the encoding (mapping to/from JSON, etc, etc.). In contrast, the Protobuf ID (not an RFC) is spartan. Complexity is important, but page count of those two documents is a poor basis for technical evaluation.

It may not matter but the internet standards draft from Protobuf expired in 2013, there is no record of review or iteration, never made it to RFC.

@andres-h From what I understand from your comments above your reasoning conflates your preferred information structuring (schema) with your currently preferred encoding (Protobuf) in a way that makes it hard to separate. Of course you can approach this however you'd like, but you should understand that this makes it more difficult for folks to understand the relative merits of the different design decisions, e.g. Protobuf.

andres-h commented 7 years ago

It may not matter but the internet standards draft from Protobuf expired in 2013, there is no record of review or iteration, never made it to RFC.

Yes, I know that. Apparently Google was not interested in pushing it. Maybe the encoding is just so trivial that RFC did not make sense.

From what I understand from your comments above your reasoning conflates your preferred information structuring (schema) with your currently preferred encoding (Protobuf) in a way that makes it hard to separate.

Really? My proposal is the so called "chunks" format aka option #3 in the white paper. I've suggested two encodings, one of which is Protobuf. Description of the chunks format is independent of encoding, except some remarks. I wrote most of the text before I was an advocate of Protobuf after all.

Protobuf just fits with the chunks format perfectly, but it's true that what I wrote is not unique to Protobuf. It applies to chunks format regardless of encoding. Except that you can use standard toolsets with Protobuf (record [w/o archive header] is a valid Protobuf message).

Go on with CBOR, I'm not holding you back :)

chad-earthscope commented 7 years ago

It may not matter but the internet standards draft from Protobuf expired in 2013, there is no record of review or iteration, never made it to RFC.

Yes, I know that. Apparently Google was not interested in pushing it. Maybe the encoding is just so trivial that RFC did not make sense.

This is not just for you. It is a public record of a conversation that may be visited by others wanting to understand the evaluation, merits and problems of various approaches. Which is why...

Protobuf just fits with the chunks format perfectly, but it's true that what I wrote is not unique to Protobuf.

clarifying that when you write "Protobuf" you actually mean "Option 3" is important.