structure of "other stuff"

iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats

3 stars 1 forks source link

structure of "other stuff" #1

Open crotwell opened 7 years ago

crotwell commented 7 years ago

Within the old strawman, there was a transition from blockettes to just having essential stuff in the mseed header, and with an "other stuff" section for things that might be useful but were not absolutely essential. The format was just simple ~ delimited strings, allowing things like name=value~

I feel like there should be a more structured way of doing this. Two ideas, first is messagepack. It is binary, pretty simple and would allow objects/arrays in addition to name-value. Lots of language bindings and writing a writer or parser feels easy enough to me. http://msgpack.org/

Second idea, as I mentioned at the meeting, would be to just use json. At least you get a lot of structure and self description, plus separation between numbers and non-numbers. http://json.org/

I prefer messagepack, given that mseed is binary and so maybe the other stuff should be as well, but not strongly. Json is pretty nice and if we are going to have strings anyway, the extra overhead seems small.

Thoughts?

andres-h commented 7 years ago

I'd also consider http://ubjson.org/

Maybe JSON, messagepack and/or UBJSON could be additional encoding types for generic records.

For the "other stuff" I'd still prefer something similar to blockettes that are well defined and can be interpreted by other people. Otherwise we need schemas, namespaces, etc.

(I'd like to avoid distinction between "essential" and "other stuff" in the first place.)

A nice feature of UBJSON (and probably messagepack) that could be adopted is variable length encoding of numbers.

One drawback of JSON-type formats is that tags need to be repeated and take a lot of space, while a blockette has just a number. It would be possible to compress JSON with gzip or similar, but that takes additional CPU power.

Regards, Andres.

crotwell commented 7 years ago

I also looked at ubjson, and there is also bson, jsonb and a couple of others. Messagepack felt felt better to me as a good combination of simplicity, active use and development, lots of language bindings and flexibility, but I am not wed to it. I just prefer some more structure than simple ~ strings.

There seem to be 3 levels in some sense,: no structure parsing structure without meaning parsing with meaning

~ strings is pretty much no structure, the fixed header and blockettes are parsing with meaning. For essential stuff I feel you might as well lock it down, but I like the idea of the middle road for everything not truely essential, where the specification defines enough structure for basic parsing of the record, but doesn't lock down the meaning of the structure to the point that the really hard fdsn approval process has to be used for any new type of item stored.

krischer commented 7 years ago

I think any of the binary formats would be acceptable; text based JSON is IMHO too large, too slow to parse, and it also feels "clunky" to add a text based block into a binary format. Of these messagepack is probably the biggest and most used and I'd choose it based on this virtue alone. I've also used it before and well - it did what it was supposed to do I guess.

An added advantage of any of these formats that can store nested key-value pairs is that they could be fed almost directly to any of the many object based NoSQL databases (newer PostgreSQL, MongoDB, ...).

Regardless of which format is chosen in the end: It should only be one - there is little advantage for having multiple of these and it will just complicate a lot of things.

I also feel like "parsing structure without meaning" is the only realistic way forward. Any kind of added semantics would require some form of approval process and I honestly can't see that happening in a reasonable time frame. We could force a namespace field for each block of information in "other" though and "parsing with meaning" could then happen over time without requiring changes to the core format.

There is also another "family" of serialization formats out there - things like protobuf (https://developers.google.com/protocol-buffers/), Avro (http://avro.apache.org/docs/current/), and thrift (http://thrift.apache.org/). Of these Avro also always stores the schema with the data which might be worth a look. Something like protobuf might also be an alternative to the fixed header but that is another discussion.

andres-h commented 7 years ago

On 06/22/2017 09:59 AM, Lion Krischer wrote:

I also feel like "parsing structure without meaning" is the only realistic way forward. Any kind of added semantics would require some form of approval process and I honestly can't see that happening in a reasonable time frame. We could force a namespace field for each block of information in "other" though and "parsing with meaning" could then happen over time without requiring changes to the core format.

But this means only I can read the data that I produce, because nobody else knows the semantics...

There is also another "family" of serialization formats out there - things like protobuf (https://developers.google.com/protocol-buffers/), Avro (http://avro.apache.org/docs/current/), and thrift (http://thrift.apache.org/). Of these Avro also always stores the schema with the data which might be worth a look. Something like protobuf might also be an alternative to the fixed header but that is another discussion.

I've used protobuf for data that is written and read by the same program, but I don't know how stable the protobuf format is -- there are already several versions.

The stability is a problem with any external formats. For example, there might be a new version of messagepack after few years and we would be stuck with an obsolete messagepack version.

crotwell commented 7 years ago

One additional advantage of having some type of "structured other" is that it would be easier to define a standard loss-less mapping for all of the existing blockettes and existing other items so that there is a specific mseed2 to mseed3 transition path. It could be done with ~ delimited strings I guess, but would not be nearly as clean.

chad-earthscope commented 7 years ago

As you'll see in the 20170622 I posted in #2, I've left all the extra headers in a tilde-delimited structure. Then there are reserved extra headers that are defined by the FDSN. The vast majority of which are the information in SEED 2.4 data record blockettes (thus providing forward compatibility), and not all that complex. The concept is that a creator can use the reserved headers and conform to their approved meaning and also include arbitrary headers with a very few limitations (tildes are special and UTF-8) as desired. So a very simple structure with agreed semantics for reserved headers, while allowing for nearly-arbitrary additions.

I agree with @andres-h that stability of an external format is something to seriously consider. While I would guess that JSON will be at least readable in 10 or 20 years I'm much less sure about MessagePack, etc. They would be more compelling if they were reference-able internet standards (e.g. RFC). On the other hand, tilde-delimited UTF-8 strings are very, very likely going to be readable by Python 42 or whatever languages are being used in the future, and that has a lot of value.

I also question how much structure is really needed, i.e. do we really need JSON-level structure, at the data record level? For what real case?

One of the main drivers for me has been simplicity, aka make it as easy to parse as possible. I'm leaning away from embedding JSON, further packed or not, unless it's demonstrated to address something real.

andres-h commented 7 years ago

On Friday 2017-06-23 23:06, Chad Trabant wrote:

As you'll see in the 20170622 I posted in #2, I've left all the extra headers in a tilde-delimited structure. Then there are reserved extra headers that are defined by the FDSN. The vast majority of which are the information in SEED 2.4 data record blockettes (thus providing forward compatibility), and not all that complex. The concept is that a creator can use the reserved headers and conform to their approved meaning and also include arbitrary headers with a very few limitations (tildes are special and UTF-8) as desired. So a very simple structure with agreed semantics for reserved headers, while allowing for nearly-arbitrary additions.

Arbitrary additions in archived data are garbage after 20 years, though, because nobody will know the semantics anymore.

I agree with @andres-h that stability of an external format is something to seriously consider. While I would guess that JSON will be at least readable in 10 or 20 years I'm much less sure about MessagePack, etc. It would be a more compelling if they were reference-able internet standards (e.g. RFC). On the other hand, tilde-delimited UTF-8 strings are very, very likely going to be readable by Python 42 or whatever languages are being used in the future, and that has a lot of value.

I thought tilde-delimited strings became obsolete 30 years ago :)

Seriously, flattening blockettes into keys-values is IMO a regression compared to MS2. Not only waste textual keys and ASCII encoding a lot of space, I don't imagine how you are going to maintain the keys in the forthcoming decades when many more are added and existing ones require revision?

My suggestions:

Make keys variable length (binary) integers instead of strings. This allows unlimited number of keys without wasting space. Frequently used (standard) keys should have smaller numbers to save space. Numeric keys can be mapped to symbolic textual names in software -- this way you are free to revise the symbolic names as needed, have compatibility aliases, etc., without affecting archived data.
Allow binary values.
Allow inclusion of MS2 blockettes without any modification:

MS2BLK200 =

Consider using the same mechanism for waveform data, eg.:

WFDATA =

This way you can drop non-applicable fields (eg., sample rate, encoding, number of samples) in case of non-waveform records. Moreover, there can be several types of data, for example data that is sampled at varying time intervals, multi-dimentional data, etc. (remember that we want to extend mseed beyond seismology).

Definition of transient data payload sub-header probably belongs to a different standard (real-time transfer protocol).

Regards, Andres.

andres-h commented 7 years ago

OK, suppose I want to have my arbitrary addition. What would be the options with numeric keys and binary values (a.k.a. chunks)?

"OPAQUE" resp. "TILDE_SEPARATED_LIST" type could be reserved for this:

TILDE_SEPARATED_LIST = mykey1=myvalue1~mykey2=myvalue2

Different key ranges could be reserved for different organizations. For example, 0..999999 could be reserved for FDSN, 1000000..1999999 could be reserved for IRIS, 2000000..2999999 could be reserved for GFZ and so on. So I can register my local addition at GFZ.
2^128..2^129-1 range could be reserved for UUID-based keys, so I can just generate an UUID for my addition and take note myself.

Variable length integers could be encoded the same way it is done in UTF-8 and Protobuf: 7 least significant bits are used for data, most significant bit tells that more bytes follow. 2000000 takes 21 bits, eg., only 3 bytes with this encoding. An UUID-based key would take 19 bytes, which is still less than many suggested textual keys.

Regards, Andres.

andres-h commented 7 years ago

MS2BLK200 = <data>

Here's a little Python program to illustrate what I mean above.

I do think most MS2 blockettes should be deprecated in MS3.

#!/usr/bin/env python3

# https://github.com/fmoo/python-varint/blob/master/varint.py

import io

def encode_varint(number):
    buf = b''
    while True:
        towrite = number & 0x7f
        number >>= 7
        if number:
            buf += bytes(((towrite | 0x80),))
        else:
            buf += bytes((towrite,))
            break
    return buf

def decode_varint(stream):
    shift = 0
    result = 0
    while True:
        i = ord(stream.read(1))
        result |= (i & 0x7f) << shift
        shift += 7
        if not (i & 0x80):
            break
    return result

def encode_chunk(key, data):
    return encode_varint(key) + encode_varint(len(data)) + data

def decode_chunk(chunk_data):
    stream = io.BytesIO(chunk_data)
    key = decode_varint(stream)
    data_len = decode_varint(stream)
    data = stream.read(data_len)
    return (key, data)

# key for MS2BLK200 (chunk type)
MS2BLK200_TYPE = 200

# length of blockette 200 (minus type and next blockettes byte number)
# according to SEED 2.4 manual
MS2BLK200_LEN = 4+4+4+1+1+10+24

# data
MS2BLK200_DATA = b"DATA"*int(MS2BLK200_LEN/4)

chunk_data = encode_chunk(MS2BLK200_TYPE, MS2BLK200_DATA)

print("chunk data:", chunk_data)
print("length of blockette in MS2:", 2+2+MS2BLK200_LEN)
print("length of chunk in MS3:", len(chunk_data))

print("decoded chunk:", decode_chunk(chunk_data))

# timing quality
tq_draft20170622 = "TQ=100"

TQ_TYPE = 1
TQ_DATA = bytes((100,))
tq_chunk = encode_chunk(TQ_TYPE, TQ_DATA)

print("length of timing quality in draft20170622:", len(tq_draft20170622))
print("length of timing quality chunk:", len(tq_chunk))

chad-earthscope commented 7 years ago

On 06/22/2017 09:59 AM, Lion Krischer wrote:

I also feel like "parsing structure without meaning" is the only realistic way forward. Any kind of added semantics would require some form of approval process and I honestly can't see that happening in a reasonable time frame. We could force a namespace field for each block of information in "other" though and "parsing with meaning" could then happen over time without requiring changes to the core format.

But this means only I can read the data that I produce, because nobody else knows the semantics...

For your arbitrary headers that is true. But not true for your use of reserved, defined headers and the rest of the format. I see user-defined headers as important for two reasons: a) allows data generators to high-grade the data with important metadata that solves current real-world needs, such as a network operator getting non-standard flags, etc., without being a problem for the format beyond potential bloat. It's important that they can do this without going through a time-consuming FDSN approval process. b) provides a way for headers to be prototyped and later proposed for standardization as a reserved header.

Arbitrary additions in archived data are garbage after 20 years, though, because nobody will know the semantics anymore.

True. There is some similarity with the minor but non-zero complexity we are adding to support stream-ability, it is useless in an archive and the cost of which will be paid for many years after it has been streamed. Alas, we making a format that addresses multiple goals. Data generators and even users have asked for the ability to add flags, etc. to the data for legitimate operational reasons.

A thought: future data centers may chose to strip the non-reserved headers for older data if they wish. Not something I'd advocate for personally, but I do not see that allowing arbitrary headers as a real problem, they can always be removed.

Seriously, flattening blockettes into keys-values is IMO a regression compared to MS2.

We have a fundamental disagreement here. These are the reasons I think binary-encoded blockettes are a mistake. As used by miniSEED 2 blockettes suffer from the lumping problem, where many fields, perhaps even unrelated fields are lumped together into the same blockette. The very common example of why this is a problem is blockette 1001 but applies to the other blockettes as well. As an example, if you want to add microsecond resolution, you need to add blockette 1001, so you then are required to specify a timing quality, which you may not know, now totally bogus values are added with no way to qualify them. A solution may be to break up every field into a blockette, which then invokes more of the overhead of the structuring. Also consider that data records typically come in small pieces, 512 to 4096 bytes, this is on the order of or smaller than many typical network packet lengths. How much metadata do we really want to pack into each of those records? Anything that is not specific to the record suffers from significant redundancy and probably belongs in higher level metadata. So not much IMO. Given that, the mechanism to store headers does not need to be focussed on efficiency as the primary goal, instead simplicity, readability and extensibility can be favored.

Tilde-separated strings are trivial even compared to binary blockette structures, and I contend that that is valuable. Also, even if I do not know the semantics of your use of non-reserved UTF-8 headers, I can at least "see" them, maybe even guess as to their use. Not ideal, but better than binary where all is opaque and lost.

I'm not totally crazy about the tilde-delimited header concept, certainly has it's issues. But when considering what we may want to put into small packets of data, allowing defined and arbitrary headers, extensibility, ease of use, future proof, and, to some degree, clarity, UTF-8 strings seem like a decent balance.

Not only waste textual keys and ASCII encoding a lot of space, I don't imagine how you are going to maintain the keys in the forthcoming decades when many more are added and existing ones require revision?

Through the FDSN. Namespaces for non-reserved headers would go a long way to avoid conflicts for any of the text-based suggestions. This is the same issue regardless of the form (UTF-8 strings or binary blockettes or JSON or packet JSON or ...), perhaps I've missed the point.

andres-h commented 7 years ago

On Saturday 2017-07-01 22:59, Chad Trabant wrote:

 Seriously, flattening blockettes into keys-values is IMO a regression compared to MS2.
We have a fundamental disagreement here. These are the reasons I think binary-encoded blockettes are a mistake. As used by miniSEED 2 blockettes suffer from the lumping problem, where many fields, perhaps even unrelated fields are lumped together into the same blockette. The very common example of why this is a problem is blockette 1001 but applies to the other blockettes as well. As an example, if you want to add microsecond resolution, you need to add blockette 1001, so you then are required to specify a timing quality, which you may not know, now totally bogus values are added with no way to qualify them.

That is not a fundamental design flaw of blockette as such, but a bad implementation of blockette 1001.

Many values do belong together and splitting them up into unrelated extra headers is equally wrong.

What about having multiple blockettes 200 in a record, or 200 (generic detector) and 201 (Murdock detector) at the same time, both of which share the same headers? You can't even unambiguously determine detector type if a single detector is used.

Moreover, many data is just binary, and I'm sure a lot of base64 encoding will be used as a workaround...

A solution may be to break up every field into a blockette, which then invokes more of the overhead of the structuring.

The overhead would be much less than the extra headers that you propose...

For example, the blockette 201 (Murdock detector) has 12 fields, so assuming average field name length of 10 bytes (very optimistic!), the field names take 120 bytes alone. Encoding ints and floats as text takes at least 3 times more space than binary encoding, so the size of blockette would jump from 60 bytes in MS2 to 300+ bytes in MS3.

Using the chunk encoding I proposed, the size would be 59 bytes (3+60-4) when lumping the fields together into one chunk and 92 bytes (3*12+60-4) when using individual chunk for each field.

Also consider that data records typically come in small pieces, 512 to 4096 bytes, this is on the order of or smaller than many typical network packet lengths. How much metadata do we really want to pack into each of those records?

Exactly... I think repeating the 20-character field names over and over again is just stupid. And a textual field name is not much more descriptive than a number, except when using super long field names.

 Not only waste textual keys and ASCII encoding a lot of space, I don't imagine how you are
 going to maintain the keys in the forthcoming decades when many more are added and existing
 ones require revision?
Through the FDSN. Namespaces for non-reserved headers would go a long way to avoid conflicts for any of the text-based suggestions. This is the same issue regardless of the form (UTF-8 strings or binary blockettes or JSON or packet JSON or ...), perhaps I've missed the point.

I mean what if Murdock detectors 2.0 and 3.0 are implemented (maybe not the best example, but you get the idea)? Maybe Quanterra releases a new datalogger and wants to have a new version of their timing quality data. The field names will be a total mess in the end.

chad-earthscope commented 7 years ago

Perhaps we have very different ideas of how many extra headers would be in common use in the posted draft specification. I'm thinking there would not be many in each record. The logic goes that the commonly used blockettes in mseed2 (100,1000,1001) are no longer needed and other blockettes are either unused or only occur at very long intervals (e.g. events and calibrations). Probably the most common extra headers will be "QI=D~" (for data converted from mseed2) and "TQ=###~", 5 and 7 bytes, they are not big. A relatively low number of non-complex headers is what lead me to focus on simplicity, readability and some level of extensibility over trying to make it as small as possible or structured beyond what is needed. By allowing user-defined headers, we'll certainly end up with more, but I do not see the justification for the apparent fear there will be explosive growth that will be a "total mess".

That is not a fundamental design flaw of blockette as such, but a bad implementation of blockette 1001.

I'm sure we'll get it right this time! ;)

Many values do belong together and splitting them up into unrelated extra headers is equally wrong.

True, some of them are in groups. Field grouping is the best reason I can think of for a more complex structure like JSON, but grouping can be done with prefixes/namespaces also.

I disagree that it is equally wrong though, plus the extra headers are related by prefixes to some degree. In my opinion all of the mseed2 blockettes suffer from bloat (over-lumping) one way or another. Over-lumping is very tempting when creating fixed-length groups of fields and I'm not optimistic we would get it right if we did it again. The mseed2 blockettes are a mess to be learned from, and improved upon, not adopted.

What seems ideal are groups of fields where some can be left out when they are unknown or inappropriate.

What about having multiple blockettes 200 in a record, or 200 (generic detector) and 201 (Murdock detector) at the same time, both of which share the same headers?

The headers are in sequence, if we specify which event header in a group comes first they are grouped. Something like JSON cleanly delineate the groups if we think more definition of the grouping is needed and we are willing to accept JSON.

You can't even unambiguously determine detector type if a single detector is used.

Good catch, we need a detector type field. In fact, this is a good candidate to start a group of related event detection headers.

chad-earthscope commented 7 years ago

Variable length integers could be encoded the same way it is done in UTF-8 and Protobuf: 7 least significant bits are used for data, most significant bit tells that more bytes follow. 2000000 takes 21 bits, eg., only 3 bytes with this encoding. An UUID-based key would take 19 bytes, which is still less than many suggested textual keys.

A custom, non-standard encoding for integers would be a big setback regarding ease of use and portability. This is not the way to save space.

andres-h commented 7 years ago

On 07/03/2017 08:55 AM, Chad Trabant wrote:

Perhaps we have very different ideas of how many extra headers would be in common use in the posted draft specification. I'm thinking there would not be many in each record. The logic goes that the commonly used blockettes in mseed2 (100,1000,1001) are no longer needed and other blockettes are either unused or only occur at very long intervals (e.g. events and calibrations). Probably the most common extra headers will be "QI=D~" (for data converted from mseed2) and "TQ=###~", 5 and 7 bytes, they are not big. A relatively low number of non-complex headers is what lead me to focus on simplicity, readability and some level of extensibility over trying to make it as small as possible or structured beyond what is needed. By allowing user-defined headers, we'll certainly end up with more, but I do not see the justification for the apparent fear there will be explosive growth that will be a "total mess".

You don't see the big picture. We want to extend the format to other domains to be able to also archive non-seismological data, plus we of course want to continue supporting detection blockettes and add more stuff.

As I already mentioned, one thing that I very much want to add is instrument ID in every record, which would allow to verify the correctness of response. It happens way too often that StationXML data is incorrect, resulting in incorrect research results.

I'm dreaming of situation where you download mseed data and the software automatically fetches the correct response based on instrument ID, like Windows downloads drivers when you plug in a new device. One could use StationXML in addition to double-check.

There are instruments like the Nanometrics Meridian, where you can even download dataless files from the embedded web interface. Probably there will be more of such instruments in the future. There can be also a plug-and-play protocol that allows the digitizer to automatically recognize the sensor (we are talking about forthcoming decades). In this case, the digitizer could easily embed the correct instrument ID in data packets.

Similar to USB VID/PID, the instrument ID would consist of:

sensor vendor ID (2 bytes) sensor product ID (2 bytes) sensor preset (1 byte) digitizer vendor ID (2 bytes) digitizer product ID (2 bytes) digitizer preset (1 byte)

Those could be grouped as two chunks (sensor and digitizer), taking 14-18 bytes total.

Using extra headers, this would probably be something like:

EIDA:SensorVID=1234~ EIDA:SensorPID=1234~ EIDA:SensorPreset=12~ EIDA:DigitizerVID=1234~ EIDA:DigitizerPID=1234~ EIDA:DigitizerPreset=12~

This takes 131 bytes!

As you see, I prefixed the headers with EIDA, so the fact that IRIS is against any innovation is not an argument here.

I'm just sad that we miss the chance to make mseed a great format and instead fuck it up completely.

Andres.

andres-h commented 7 years ago

On 07/03/2017 09:11 AM, Chad Trabant wrote:

Variable length integers could be encoded the same way it is done in
UTF-8 and Protobuf: 7 least significant bits are used for data, most
significant bit tells that more bytes follow. 2000000 takes 21 bits,
eg., only 3 bytes with this encoding. An UUID-based key would take
19 bytes, which is still less than many suggested textual keys.

A custom, non-standard encoding for integers would be a big setback regarding ease of use and portability. This is not the way to save space.

WTF is "custom, non-standard encoding"? WE create the standard!

You can also say that Steim2 is custom, non-standard integer encoding. And varint is much more widely used than Steim2.

PS. forget about using UUID as a key directly -- using larger keys than 2^32 is indeed not a good idea for portability reasons, besides using a 1-byte ID + UUID takes 17 bytes instead of 19 bytes as suggested above.

chad-earthscope commented 7 years ago

WTF is "custom, non-standard encoding"? WE create the standard!

An alternate way to represent integers, one of the most fundamental data types in computing and nicely portable (beside byte order), needs justification beyond, what will likely be, a small savings in space.

You can also say that Steim2 is custom, non-standard integer encoding. And varint is much more widely used than Steim2

Beyond seismology, Steim encodings are exactly that, their use is justified by precedence, because they save significant space and lack of other options. We have also been asked, and I agree, to consider alternate compressors that are standardized and used beyond seismology. As far as I understand varint usage it is great for things like databases (where the front end user never sees it) and transient formats/containers like Protocol Buffers. I haven't seen it much in archive formats, but maybe that's just me.

Similar to USB VID/PID, the instrument ID would consist of:

sensor vendor ID (2 bytes) sensor product ID (2 bytes) sensor preset (1 byte) digitizer vendor ID (2 bytes) digitizer product ID (2 bytes) digitizer preset (1 byte)

This is the first time you have written of this idea in that level of detail that I remember. I like that it and agree that we should find an efficient way to be able to add such identifiers.

Another idea: you have put forward many concepts that are very different from what is in 20170622 draft. At this point I can't tell if you are talking about just the extra headers/other stuff or if you are still talking about "all blockettes" for the entire record. Instead of wedging each of them into the draft concept, perhaps you should create an alternate draft that is more of a complete vision of how you think the whole thing should be like that can be evaluated.

andres-h commented 7 years ago

Another idea: you have put forward many concepts that are very different from what is in 20170622 draft. At this point I can't tell if you are talking about just the extra headers/other stuff or if you are still talking about "all blockettes" for the entire record. Instead of wedging each of them into the draft concept, perhaps you should create an alternate draft that is more of a complete vision of how you think the whole thing should be like that can be evaluated.

Alll my ideas were summarized in my first post after your draft, 9 days ago:

Make keys variable length (binary) integers instead of strings. This allows unlimited number of keys without wasting space. Frequently used (standard) keys should have smaller numbers to save space. Numeric keys can be mapped to symbolic textual names in software -- this way you are free to revise the symbolic names as needed, have compatibility aliases, etc., without affecting archived data.

Allow binary values.

Allow inclusion of MS2 blockettes without any modification:

MS2BLK200 = <data>

Consider using the same mechanism for waveform data, eg.:

WFDATA = <sample rate> <encoding> <length> <data> <number of samples>

This way you can drop non-applicable fields (eg., sample rate, encoding, number of samples) in case of non-waveform records. Moreover, there can be several types of data, for example data that is sampled at varying time intervals, multi-dimentional data, etc. (remember that we want to extend mseed beyond seismology).

I was hoping that we can converge, but you are probably right that I should write my own draft. However, without a mandate from ORFEUS or EIDA I don't see much point in doing that.

I did send some of my ideas to Reinoud, but I have no idea what he is doing with the white paper. I haven't got any notes from the meeting either.

crotwell commented 7 years ago

I am sympathetic to the argument that keeping the miniseed spec small and tightly focus is a good thing, but also feel that structure in the extras is worth something. In particular, looking at the fdsn extra header keys, the amount of grouping via prefixing seems clunky. For example compare: TimeException=20170703T123456Z~TimeExceptionCount=4~TimeExceptionType=BadStuffGoingOn with a style with some form of object/containment, say json: {"TimeException":{"Time":"20170703T123456Z";"Count"=4;"Type"="BadStuffGoingOn"}}

I am not so worried about the byte count, but rather that the time, count and type go together. Using a containment { ... } achieves this naturally, whereas grouping by prefix seems fragile to me. IN other words, it adds the complexity that the keys are no longer just keys, but that order matters. So imagine code that parses a record, a natural thing to do would be to store all the extra headers in a hashmap, but then order is lost. The prefix containment idea means that extra work is shifted from the data creation end to the data parsing end, and does it in a way that requires knowledge of which prefixes represent grouping and which are independent. If containment is not really needed, then ~ key value is ok, but as soon as you need containment you really need the format to support that.

On top of that, any concern that JSON might not be parsable 30 years from now is laughable. Yes, any given json library might cease to exist, and maybe the entire rest of the world gives up on json and moves on to something else, but if we can write code to parse mseed, we can write code to parse json. The entire spec is a few pages, just include it directly as an appendix. Messagepack is slightly more complex, but even it could be added as an appendix if we wanted.

FYI, json is a standard, just via ECMA instead of IEEE, and just 5 pages with big figures: https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf I think it is kind of cool it is spec number 404.

The binary vs text is a still good question, and I am not convinced which is better. I like messagepack, but the main advantage of messagepack over json is just that it is binary. Adding structure to me feels more important then the particular representation.

For another use case, a useful feature of the SAC format is the ability to store predicted travel times a la taup_setsac. And storing the phase name, ray parameter and time together is really helpful. So if there was a taup_setmseed, then maybe it would do something like this in json:

{"taup":{"model":"iasp91", "event":{"time":"20170629T123400.78Z","lat":42.1,"lon":-15.2,"depth":103.4}, "station":{"lat":25.8,"lon"-46.1}, "arrivals":[ {"phase":"P","time":"20170629T123456.78Z","rayParam":442.83},{"phase":"S","time":"20170629T123556.78Z","rayParam":656.83},{"phase":"PP","time":"20170629T123586.78Z","rayParam":856.83},{"phase":"SS","time":"20170629T123656.78Z","rayParam":956.83} ]}

Obviously this would not be something used by a datalogger or likely even a data center, but is really useful to an end user and would be horribly complex to do in key=value~ style.

All this said, if the decision to just do key=value~ is the way it goes, we should at least make sure there is nothing that prevents the value from being json!

chad-earthscope commented 7 years ago

Yeah, OK. Given the recent discussion about grouping/containment and avoiding conflicts of, potentially arbitrary stuff, I'm much more on board now with the idea that we probably need more structure than the tilde separated strings (and sequence and prefix) can provide without being clunky and fragile.

Since this seems to be a sticking point, I suggest everyone rank the discussed options for including optional/extra headers thus far in order of preference. I see these as:

Bare JSON
Packed JSON (MessagePack, UBJSON, etc.)
Tilde separated UTF-8
Binary structures, aka blockettes

Above is more or less my order of preference at the moment. I am also less concerned about size than getting it right in other ways. I also think we should pick one and not support multiple.

Bare JSON provides the structure we need and allows optional/inappropriate/unknown entries of a group to be left out. Conceptually, JSON seems like the right direction. The cost is that it requires a secondary, non-trivial parser and it's relative large in volume.

Packed JSON addresses the size of JSON at the cost of adding another non-trivial parser. I'm still on the fence about whether it's wise to use this for an archive format. Beyond a "Last modified" date, the specification is not versioned in the place everyone points to (https://github.com/msgpack/msgpack/blob/master/spec.md). Oddly there are references elsewhere that the the current version is 5? If we went this way, I think we'd need to do wholesale adoption and copy and paste the spec into our own spec as suggested. As has been pointed out, if MessagePack moves on to new versions in the future, our community will need to maintain the older libraries or write new ones as needed.

Tilde-separated is really simple, parsing is trivial. Seems that at least 3 of you think it's too simple, and I'm near convinced too.

Binary structures are small. Downside is that everything is custom, we invent every detail and maintain every parser. Also, the blockettes either suffer from forced-required fields and potential bloat if grouped or we break each field into a separate entry which has the same grouping issues as tilde-separated strings. Small size and total control are tempting, but worth it?

If we go JSON of some flavor we'd probably want to define the fields in http://json-schema.org/, or are there alternatives?

Anyone willing to try and create such a JSON Schema document that defines mseed3 extra fields?

PS. The comment about not being standard was for MessagePack not JSON itself. I agree that JSON is a standard and the longevity is not a concern.

andres-h commented 7 years ago

Problem with any schema-less formats (JSON, MessagePack, Tilde separated UTF-8) is that field names waste a lot of space due to the record-oriented structure of mseed where fields must be repeated in every record.

It could be better in case of a format where you, eg., add the timing quality field only when the timing quality changes.

It could be OK with an exchange format, when you receive the data and throw it away (and the data is zipped during transfer), but not with a storage or archive format that mseed is (also) supposed to be.

Extensions must be properly documented anyway. Suppose I'm analysing 20 years old data and find {"TimeException":{"Time":"20170703T123456Z";"Count"=4;"Type"="BadStuffGoingOn"}}. I could try to guess what it means, but that is not scientific.

Another problem with JSON-type formats is that it is difficult to add things in the processing chain. This is even true with MS2 blockettes, because you have to modify "next blockette's byte number", which means you have to parse all blockettes from the beginning to add a new blockette to the end.

The "chunks" that I suggested do not have that problem.

I'm not very familiar with JSON Schema, but does it allow namespaces or something, such that different manufacturers and organizations can independently add their own parts?

Binary structures are small. Downside is that everything is custom, we invent every detail and maintain every parser. Also, the blockettes > either suffer from forced-required fields and potential bloat if grouped or we break each field into a separate entry which has the same grouping issues as tilde-separated strings. Small size and total control are tempting, but worth it?

IMHO, yes.

crotwell commented 7 years ago

Right now I think I agree with your ranking,

Bare JSON
Packed JSON (MessagePack, UBJSON, etc.)
Tilde separated UTF-8
Binary structures, aka blockettes with 1 and 2 being pretty close. I just don't have a good feel for the tradeoff between easy versus binary and small. If we expected significant use of extra headers, then binary is probably worth it. If it is a couple of values most of the time, then just json is probably fine. Maybe worth exploring both options on real data before a decision is made.

The one thing I think is critical is that the parsing of the file not require custom code for user defined storage. I worry the custom binary structures require this.

Just FYI, CBOR is another binary json-like format, http://cbor.io/ and it has an RFC https://tools.ietf.org/html/rfc7049 I am not sure about how widespread it is, but might have more future proof than messagepack via the rfc while giving some of the same advantages.

I can try to have a look at the mseed2 to extra headers json idea later this week I think.

Philip

crotwell commented 7 years ago

Problem with any schema-less formats (JSON, MessagePack, Tilde separated UTF-8) is that field names waste a lot of space due to the record-oriented structure of mseed where fields must be repeated in every record.

This is true, but this needed if you are going to have user-defined extra headers. And once you have user-defined extra headers, you might as well use the same storage for infrequently used standard stuff. Unless the volume of extra headers becomes large, I think the tradeoff is worth it. In the end, only some real world experimentation can show if this is an issue or not. Probably writing some code to convert a selection of mseed2 data and see what the outcome is would be really useful. I hope to have a go at something like that later this week if I can get some time.

andres-h commented 7 years ago

On 07/03/2017 10:58 PM, Philip Crotwell wrote:

Problem with any schema-less formats (JSON, MessagePack, Tilde
separated UTF-8) is that field names waste a lot of space due to the
record-oriented structure of mseed where fields must be repeated in
every record.

This is true, but this needed if you are going to have user-defined extra headers.

No. I proposed a chunk type for user-defined extra headers. And there can be chunks types for JSON, MessagePack, everything. It's just an encoding. Like Steim2. I would not recommend using such chunks, but it's possible.

And once you have user-defined extra headers, you might as well use the same storage for infrequently used standard stuff. Unless the volume of extra headers becomes large, I think the tradeoff is worth it.

I do expect significant usage of extra headers and I don't like special cases.

If we use JSON, let's use JSON for everything.

If we use binary in fixed header, let's use binary for everything.

If we use blockettes, lets use blockettes also for waveform data.

I really should make my own draft... But then again, what's the point of it all.

In the end, only some real world experimentation can show if this is an issue or not. Probably writing some code to convert a selection of mseed2 data and see what the outcome is would be really useful.

If it works for mseed2, it does not mean it will work for all future uses of mseed3.

chad-earthscope commented 7 years ago

On Jul 3, 2017, at 1:47 PM, Philip Crotwell notifications@github.com wrote: The one thing I think is critical is that the parsing of the file not require custom code for user defined storage. I worry the custom binary structures require this.

Me too. In general I think binary blockettes/chunks are much less amenable for use with non-reserved headers/data. Just FYI, CBOR is another binary json-like format, http://cbor.io/ http://cbor.io/ and it has an RFC https://tools.ietf.org/html/rfc7049 https://tools.ietf.org/html/rfc7049 I am not sure about how widespread it is, but might have more future proof than messagepack via the rfc while giving some of the same advantages

Interesting. Check out the comparison in the RFC with similar schemes (MessagePack, UBJSON and a few others): https://tools.ietf.org/html/rfc7049#appendix-E https://tools.ietf.org/html/rfc7049#appendix-E

I agree, we need to give these things real-world tries for a deeper evaluation.

I can try to have a look at the mseed2 to extra headers json idea later this week I think.

Cool, thanks.

chad-earthscope commented 7 years ago

I really should make my own draft... But then again, what's the point of it all.

It might help others understand your complete vision. Also, I find that when you document, even in rough draft, something in a near-complete description you are pushed to think through the details and it helps identify problems that are not obvious from a very general concept view.

crotwell commented 7 years ago

@andres-h I don't think I understand what you mean by your proposal. Can you flesh it out a bit or correct my misunderstanding. From what you have said it sounds like you want the format to be an integer key followed by a sequence of bytes with numeric ranges allocated to various organizations.

But if I am parsing the file, how to I break them out? Lets say I find a file with a numeric key I have never seen before and so I want to ignore it. How do I know how many bytes to skip over to get to the next key?

If I know that the key belongs to a range controlled by GFZ and want to parse it, how do I find out how it is encoded or where to find code to parse it? Is it an fdsn approved data structure, or is it totally opaque unless I find a specification or code for that particular data key?

If the binary value can be any encoding, then a single miniseed record that has N extra headers might require N different libraries to parse it? I worry that would make miniseed parsing really bloated on the code side.

andres-h commented 7 years ago

But if I am parsing the file, how to I break them out? Lets say I find a file with a numeric key I have never seen before and so I want to ignore it. How do I know how many bytes to skip over to get to the next key?

The key is followed by length, both of which are varints. This means a minimum length of non-empty chunk (such as timing quality in percent) is 3 bytes:

I posted an example here, but maybe I wrongly assumed that everyone knows Python.

If I know that the key belongs to a range controlled by GFZ and want to parse it, how do I find out how it is encoded or where to find code to parse it? Is it an fdsn approved data structure, or is it totally opaque unless I find a specification or code for that particular data key?

It is opaque, unless you have the specification.

If your JSON has an arbitrary field and I don't know the semantics, I also cannot use it.

If the binary value can be any encoding, then a single miniseed record that has N extra headers might require N different libraries to parse it? I worry that would make miniseed parsing really bloated on the code side.

You don't need a separate parser for each chunk, just a schema that you could download from our webpage.

Our Seiscomp 3 software has something similar. You can even convert the same binary data to JSON, BSON or XML if you like...

crotwell commented 7 years ago

OK, length helps. I missed that part.

You don't need a separate parser for each chunk, just a schema that you could download from our webpage.

How is the schema specified? Is that part of your proposal?

krischer commented 7 years ago

I think we all agree that we do need some kind of user-defined data. My question is what it the main purpose of that? I see two possibilities:

Expert user extensibility of the format, e.g. instrument manufacturers, data centers, ...
Adding arbitrary stuff by end-users.

I think only the first one is desirable and feasible. MiniSEED has been (and I guess will be) a very minimal format with data archival and streaming as its main use cases. This directly opposes its use as a self-describing data format with arbitrary meta-information. There are much better (and also standardized) data formats like HDF5 out there that allow for all these things. They do so by being demanding in terms of file size and binary complexity - both of which kind of disqualify them as archiving and streaming formats. So from my point of view one has to pick one of these two and not mingle them.

If we allow end-users to add stuff it will just end in confusing data like the SAC headers which are used in all kinds of feral manners. This IMO is an anti-feature and should be discouraged. Also most user-defined data would not make a lot of sense with the inherently record based structure of MiniSEED.

Assuming we agree that the new MiniSEED only requires "expert-user extensibility" we can enforce more stringent requirements like requiring a name-space - which in turn would define the schema for the additional data so the concern of "semantic rotting" of the additional data voiced by @andres-h (which I do share) would be lessened quite a bit. Even 20 years down the road there (at least at one point) should have been some definition of the extra data via responsibility of the expert users.

Note that none of this is not tied to any specific serialization format as long as some kind of schema language exists (or can be made up) for it.

I do expect significant usage of extra headers and I don't like special cases.

If we use JSON, let's use JSON for everything.

If we use binary in fixed header, let's use binary for everything.

If we use blockettes, lets use blockettes also for waveform data.

I can also share that sentiment to some degree. It would be awkward to have different serialization styles for different parts of the headers (SEG-Y has this and its one hot mess). So if we can find something that works for both (fixed header + other stuff) that would be great but I'm not sure this exists.

Did we have any argument against protobuf? @chad-iris You mention it being transient but Apple for example uses it as the basis for its iWork (Apple word + excel) suit and I'm sure they did not just blindly pick it (they also appear to further compress the protobuf records with this which might also be interesting for us (but this is another discussion): http://google.github.io/snappy/).

Its binary encoding seems simple enough (https://developers.google.com/protocol-buffers/docs/encoding). Each piece of additional data could be defined in a separate .proto file which would grant some form of binary extensibility.

crotwell commented 7 years ago

Question, would my mythical taup_setmseed example be considered expert or end user? I am not sure I envision generic end users deciding to use this ad hoc, but would not be surprised to see software packages finding it very useful.

My feeling was that the extra stuff should be in a format where at least the structure if not meaning was self describing. My understanding is that protobuf does not do this. Without the .proto file you have a meaningless sequence of bits. In other words, protobuf is just a more standard way of doing a custom binary format. From that page:

the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).

I guess I feel like at least determining the structure and having the keys as strings be included is a better long term style then a format that requires external information to even parse it. It is also not clear to me how you would match an individual item in the extra stuff with its .proto metadata without some added information.

BTW, @krischer thanks for splitting out the issues. I was also getting lost in the discussions.

andres-h commented 7 years ago

My feeling was that the extra stuff should be in a format where at least the structure if not meaning was self describing. My understanding is that protobuf does not do this. Without the .proto file you have a meaningless sequence of bits.

A standard OPAQUE chunk can be defined in Protobuf. Then you can put JSON in it if you want.

Its binary encoding seems simple enough (https://developers.google.com/protocol-buffers/docs/encoding). Each piece of additional data could be defined in a separate .proto file which would grant some form of binary extensibility.

Looks good. I would consider Protobuf (or its subset) as one of the candidate encodings for the chunks.

BTW, @krischer thanks for splitting out the issues. I was also getting lost in the discussions.

:thumbsup:

crotwell commented 7 years ago

Simplier question, if protobuf was used, how would a reader decide which description (.proto) a given extra header goes with?

My understanding of protobuf is that it assumes the reader and writer have agreed out of band on the structure (ie .proto) in advance. There may be a way to use protobuf as the encoding for the extra headers, but there would have to be additional information encoded in a miniseed standard way to allow a reader to decipher the message. The alternative being a single universal .proto file that covers every need, which I think would be way too much centralization.

I think we need a more in detailed proposal of how protobuf would work inside mseed before we can evaluate it.

andres-h commented 7 years ago

On 07/05/2017 07:47 PM, Philip Crotwell wrote:

Simplier question, if protobuf was used, how would a reader decide which description (.proto) a given extra header goes with?

Based on (numeric) key. Nothing would change in my Python example, except that Protobuf encoding would be used instead of Python struct (which is basically the same fixed encoding that is used by MS2 blockettes).

I'm not fully sure that Protobuf is the way to go, but it sure is much more flexible than struct...

My understanding of protobuf is that it assumes the reader and writer have agreed out of band on the structure (ie .proto) in advance. There may be a way to use protobuf as the encoding for the extra headers, but there would have to be additional information encoded in a miniseed standard way to allow a reader to decipher the message. The alternative being a single universal .proto file that covers every need, which I think would be way too much centralization.

Yes.

I think we need a more in detailed proposal of how protobuf would work inside mseed before we can evaluate it.

I am talking about using the Protobuf encoding in chunks, not using the complete Protobuf toolchain. There are other ways to describe the schema instead of .proto files if you like.

crotwell commented 7 years ago

@andres-h Can you flesh this out a bit? For example if I needed to add "QI=D" and "mydc_myobj={myword="abc", myvalue=98}", what would the protobuf encoding(s) look like, especially assuming mydc_myobj was not defined until after the mseed spec is approved?

In a json or one of the binary json variants, I think it would look like: { "QI":"D", "mydc_myobj":{ "myword": "abc",
"myvalue":98 } }

andres-h commented 7 years ago

@crotwell I am not an expert of the protobuf wire format. I've only used protobuf for a simple thing last year. In fact, my .proto file is right here on the github. That project was in Go language, so I used the protobuf compiler to generate this Go file. The latter file defines a class that I can use to serialize and deserialize my object. It's basically the same in other languages.

I think depending on specific tools like the protobuf compiler would be a bit dangerous. We should make sure that the format is simple enough that we can implement it ourselves.

crotwell commented 7 years ago

@andres-h Not asking for the bytes on the wire, just the big picture of how you propose protobuf to be used, one big proto file or many independent ones, one big protobuf byte array or many independent protobuf byte arrays in order, with or without some required internal or wrapping structure.

I agree to much dependence on external tools is dangerous, which is why I prefer json or one of the simpler binary json versions as they are small enough to be an appendix to the spec. Protobuf looks really great for an over the wire system, but maybe not so good for an archive file format.

It sounds like you are not that strongly in support of protobuf, so will let be. I am just trying to understand any proposed ideas, as they say, "the devil is in the details".

chad-earthscope commented 7 years ago

Moving this discussion to the appropriate thread, from the "repeated keys":

@andres-h Wrote:

On 07/06/2017 06:15 PM, Philip Crotwell wrote:

A repeated key in the object style is a little more complicated, basically uses an array as the value, so something like: { "event": [ { ...event1 stuff...}, { ...event2 stuff...} ] }

One problem with JSON is that it is difficult to add things in the processing chain and to know the length of JSON data. You need an internal representation of the data and each time you add a new blockette:

Add the blockette to the internal representation.

Generate JSON.

Check the size of JSON data.

if the size is too large:

Remove blockette from the internal representation.

Generate JSON.

Finalize record.

Initialize new internal representation and add the blockette.

With chunks or blocks, you just check if size(chunks_so_far) + size(new_chunk) > size_limit

chad-earthscope commented 7 years ago

One problem with JSON is that it is difficult to add things in the processing chain and to know the length of JSON data. You need an internal representation of the data and each time you add a new blockette:

This is only true in your chunk/blocks model. It is not true for extra headers in some form in the termination block ("footer"), where extending the length of the field containing the extra headers is easy. You still would have to unpack/parse the header, insert new one and repack/serialize, not trivial, but also not that bad in my opinion.

andres-h commented 7 years ago

On 07/06/2017 07:31 PM, Philip Crotwell wrote:

I don't understand what you mean by "if the size is too large", there is not a limit unless you overflow the UInt16?

As a datacenter, we will definitely not accept arbitrary-sized records. I am even very much in favor of fixed record size, at least within one file, because it makes random access so much easier.

And we are presuming extras are small, so what is too large?

Even if you presume extras are small, you will eventually reach the limit when adding multiple extras.

And of course even in JSON you know the size of the new item, and the size of the existing items. New size is just existing + new and maybe + 1 for an extra comma.

OK, maybe you can pre-generate JSON for the new item, but you still have to rewrite existing JSON data to stick it to the right place.

I just think trying to fit JSON with record-oriented structure is not clean.

chad-earthscope commented 7 years ago

As a datacenter, we will definitely not accept arbitrary-sized records. I am even very much in favor of fixed record size, at least within one file, because it makes random access so much easier.

Then you should not (edit) need to be worried about adding any extra headers, because you won't be able to with a fixed size. Unless you guess at future needs and leave padding in the records...

@andres-h

I do expect significant usage of extra headers and I don't like special cases. If we use JSON, let's use JSON for everything. If we use binary in fixed header, let's use binary for everything. If we use blockettes, lets use blockettes also for waveform data.

@krischer

I can also share that sentiment to some degree. It would be awkward to have different serialization > styles for different parts of the headers (SEG-Y has this and its one hot mess). So if we can find something that works for both (fixed header + other stuff) that would be great but I'm not sure this > exists.

I feel exactly the same way. The simplicity of a single serialization would be great (maybe CBOR can do it all?), but there may not be one that fits all our needs. In which case we can limit at two, blockettes + JSON|CBOR|Protobuf.

andres-h commented 7 years ago

On 07/06/2017 10:33 PM, Chad Trabant wrote:

As a datacenter, we will definitely not accept arbitrary-sized
records. I am even very much in favor of fixed record size, at least
within one file, because it makes random access so much easier.
Then you should need to be worried about adding any extra headers, because you won't be able to with a fixed size. Unless you guess at future needs and leave padding in the records...

Firstly I hope data blocks will be smaller than the current 448 bytes. Maybe 64 bytes (1 Steim frame) would be reasonable. Then I would keep adding the blocks until there is enough free space in the record. Finally I would add some padding.

Variable length records would not be the end of the world, but then each MS file would need an index for direct access, especially large files with high sample rate data.

I definitely don't want anyone to get an idea to send us files with one record per day or something (unless it's very low sample rate data).

chad-earthscope commented 7 years ago

Did we have any argument against protobuf? @chad-iris You mention it being transient but Apple for example uses it as the basis for its iWork (Apple word + excel) suit and I'm sure they did not just blindly pick it (they also appear to further compress the protobuf records with this which might also be interesting for us (but this is another discussion): http://google.github.io/snappy/).

I have no doubt it's use is widespread. I don't think I would consider iWork documents ready for a long term library archive though. In 20 years those will probably be as useful as Word Perfect documents ;)

I didn't think it had an RFC but apparently it does: https://tools.ietf.org/html/draft-rfernando-protocol-buffers-00

I don't know how it would do with all the fields we need to add and (opaque) encoded data blocks, e.g. Steim frames.

If someone wants to try and map out how we would build a Protobuf miniSEED record and figure out if they can be used like miniSEED go for it. If we are really willing to consider that, we should probably do the same for CBOR, at first glance it looks simpler.

crotwell commented 7 years ago

My $0.02 from my reading, protobuf too flexible and includes too much meta-information for the fixed header stuff and is too inflexible and does not include enough meta-information for extra headers.

From wikipedia:

Canonically, messages are serialized into a binary wire format which is compact, forward- and backward-compatible, but not self-describing (that is, there is no way to tell the names, meaning, or full datatypes of fields without an external specification). There is no defined way to include or refer to such an external specification (schema) within a Protocol Buffers file.

Protobuf is a sequence of effectively <field number><length><data bytes>. This is the minimum information to allow a parser that knows about the .proto file to find and parse the fields, but also allows an parser with an older version of the .proto to skip over fields that it doesn't know about. This kind of feels like what Andreas is advocating for in his chunks or blocks, but at a much finer level. It is not larger blocks, but on each and every int, float and string.

For our fixed header, the field number and length are just wasted bytes for all the fields that we know will be required. We don't need or want them to be optional or out of order, nor do we expect frequent additions or subtractions to the fixed header.

For the extra headers, protobuf is too restrictive (I think), as it needs the definitions to all live in a single .proto file and for both the writer and reader to be based on the same definition (or maybe different versions of the same definition) as this is how the numeric keys are defined. Moreover, as all of the structure and meta-information, like are the data bytes a float, int or string, lives in the single external .proto definition, individual datacenters or users can't add fields on their own and even if they could, another reader can't even parse the structure of the data bytes. While this makes a lot of sense for an over the wire format (protobuf's primary use case) I feel this level of centralization and lack of self-description is probably not what we want for the extra headers.

For the Apple iWork case, there use is very different in that they control both the reader and writer, and so a single master definition of the format is a benefit, but for us I think it is a handicap.

crotwell commented 7 years ago

Extremely simple example of creating cbor, uses json as the input. Shows the bytes and the length. Not really that useful in itself, but just letting you know it as might be useful basis for more experimenting with cbor.

https://github.com/iris-edu/mseed3-evaluation/tree/master/Crotwell/extraHeaders/cbor

chad-earthscope commented 7 years ago

Just ran across this.

CBOR there is a schema definition on the standards track called CDDL.

Abstract

This document proposes a notational convention to express CBOR data structures (RFC 7049). Its main goal is to provide an easy and unambiguous way to express structures for protocol messages and data formats that use CBOR.

Small example is here: http://cbor.io/tools.html

crotwell commented 7 years ago

Yep, I saw that and even started to translate the json-schema to cddl, but got distracted. Maybe will give it another try. We probably should explore using cbor at least as fully as we explore json.

Only qualm I have is the date on the CDDL Internet-Draft is July 03, 2017. So this is pretty new. No idea of what the likelihood is for changes before (or if) it gets to be an RFC. My guess is that as long as we only use the basics, it should be fine. And the FDSN process is likely to take longer than the IETF, so plenty of chance to catch any modifications to it.

Is the general consensus that CBOR is the most promising of the binary json formats, just in terms of where to devote our limited time and energy?

chad-earthscope commented 7 years ago

Is the general consensus that CBOR is the most promising of the binary json formats, just in terms of where to devote our limited time and energy?

It is for me, the more I look at it the more I like it. I will be evaluating both a hybrid binary-CBOR and a full CBOR implementation.

@andres-h is evaluating Protouf, which is good to have experience with.

If we want to go with one of these binary encodings, for extras or "full record", we'll need to decide between the two. I'm leaning towards CBOR because one of there primary goals is "CBOR is defined in an Internet Standards Document, RFC 7049. The format has been designed to be stable for decades.", whereas I do not think that was ever a goal for Protobuf. Not that it couldn't be used, just don't know.

andres-h commented 7 years ago

In which aspects is CDDL better than JSON schema? If CBOR is just a binary representation of JSON, one could use JSON schema for CBOR and vice versa.

The format has been designed to be stable for decades.", whereas I do not think that was ever a goal for Protobuf.

That is a terrible misunderstanding. AFAIK there have never been any changes in the Protobuf encoding since it was released to the public (2008), and it is designed to be fully forward- and backward-compatible.

As I wrote before, Protobuf encoding should not be mixed with Protobuf software. It is like mixing miniSEED and libmseed (eg., "miniSEED is not stable, because the API of libmseed may change").

crotwell commented 7 years ago

Did some more reading, CDDL does not appear to be that useful for extra headers. We could define a CDDL "schema" for the fdsn standard parts, but there does not seem to be a way to do the equivalent of "additionalProperties" as in json-schema. So there isn't an easy way to simply have other as yet undefined key-value pairs but also validate with the schema. Or at least I didn't see it.

This may not be a killer issue, as an extra header spec could exist in CBOR without validation via CDDL, but it is not as simple as the json-schema case.

Also, stumbled into some more information on the messagepack cbor relationship. Looks like cbor started as a hostile fork of messagepack, some bad feelings on both sides with a rush to get RFC status. I don't know that is a reason to be for or against it, but not all warm and fuzzy. I am also rethinking the RFC issue a bit. An RFC is desirable if it reflects a widespread consensus and a likelyhood of longevity, but there are plenty of RFC's that were approved and then fell out of use. In other words, the spec has to have merit on its own regardless of if an RFC exists.

Just some numbers of tagged messages on stackoverflow:

cbor 15
messagepack 270
jsonb 560
protobuf 3257
json 205,408

Some links: https://www.ietf.org/mail-archive/web/json/current/msg00195.html

chad-earthscope commented 7 years ago

@andres-h

In which aspects is CDDL better than JSON schema? If CBOR is just a binary representation of JSON, one could use JSON schema for CBOR and vice versa.

I don't know exactly yet. One could guess that if JSON Schema were a perfect fit they would never have made CDDL. CDDL is directly targeting CBOR and handles it's types, whereas converting CBOR to JSON is not a direct mapping and there is only non-normative advice to do that operation: https://tools.ietf.org/html/rfc7049#section-4. So I'm speculating that it's a "better fit".

The format has been designed to be stable for decades.", whereas I do not think that was ever a goal for Protobuf.

That is a terrible misunderstanding. AFAIK there have never been any changes in the Protobuf encoding since it was released to the public (2008), and it is designed to be fully forward- and backward-compatible.

Please do not exaggerate what I wrote. I said not a goal of that encoding and I stand by that. I did not say it was not stable or would not be stable.