iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

JSON schema for extra headers #19

Open crotwell opened 7 years ago

crotwell commented 7 years ago

Here https://github.com/iris-edu/mseed3-evaluation/commit/058dd8d4d0f58be0e6ff4ea9cf18005abc6ecc31 is a first try at a json schema for the extra headers based on Chad's list of things, my grouping them and very simple json-schema for adding additional ones. Event detection and time exceptions are arrays to allow multiple, the others are singles.

I am not sure if my grouping is correct, it mostly was taking Chad's list with each section becoming an object. Would be helpful for some more eyeballs to check this.

There is also a very simple example text file.

Found this site: http://www.jsonschemavalidator.net/ that allows validating a schema against the json-schema schema and validating an instance document against a general json-schema. Seems useful.

I think it may be worth revising some of the keys for brevity and thinking about items that might be required within subobjects. Json-schema allows "required" properties, but I did not set any. I also disallowed adding more fields to the predefined objects. That may be worth thinking about.

I did put in a regex pattern for ISO times, but otherwise there no restrictions on the strings. Things that were simple keys (with no value) in Chad's list were turned into key-boolean. In some cases, like the signal quality headers, it maybe be simpler to have an array of strings instead of each key being boolean. But at least it is a start.

Even if we do not end up using JSON, it is probably a good idea to go through this exercise to refine what things will be there and what limits there will be.

Comments please....

crotwell commented 7 years ago

Added a validate script couple simple examples. Also modified the schema so that all user defined extra headers must start with a lower case letter.

https://github.com/iris-edu/mseed3-evaluation/tree/master/Crotwell/extraHeaders/jsonSchema

chad-earthscope commented 6 years ago

This looks pretty good at first glance. I think this would serve well as a way to validate reserved header information if it's in JSON or something that can be turned into JSON easily.

I do not remember how JSON Schema operates with data that is outside of the defined schema, might make it troublesome to validate extra headers that have non-reserved entires.

Another thought, maybe "EventInProgress" should be "EventDetection". The former is meant to be able to be applied to any record where an event is occurring, like during a large earthquake the signal can span multiple records. I guess it could be put in as an empty object, but it's a little weird to treat is as a container and a flag?

Also modified the schema so that all user defined extra headers must start with a lower case letter.

This is very subtle. We may need a more obvious way to delineate FDSN reserved versus entries generated and defined by other groups. Some ideas: a high level object, where all the reserved headers are inside an "FDSN" object and each group that makes there own headers makes their own high level object to contain them? Alternatively, all keys contain a namespaces, e.g. "orfeus:wavespeed": 14 or "fdsn:TimeException": { ...}. Maybe reserved keys do not need a namespace? That later idea gets pretty big but allows non-reserved keys inside reserved object-keys. Not sure JSON schema can be made to enforce those things though.

crotwell commented 6 years ago

I intentionally tried to keep to just the existing list. Refining the names is important I think, but wanted to separate that from the schema structure. I think refining names is also easier once you can see an actual document. So +1 on any changes you think.

So for any object in json-schema, you basically have 3 levels

  1. Defined keys, exact match, so like the existing keys pulled from existing key list
  2. patternProperties where you can specify a regex to match and make other restrictions on things that match but are not in 1
  3. addtionalProperties, where you say if you allow things that don't match either 1 or 2, can be true meaning anything goes, false, meaning nothing allowed, or an object where you make certain restrictions.

Then you can also make restrictions on the values of course.

I am not sure how one would combine 2 or more schemas, in the sense that the main schema says "additional stuff is ok" and a separate schema actually validates it. May be possible, but I have not read enough to know. You could just say it is unvalidated as far as mseed3 is concerned, and if users want an external validation of their items, this happens as an independent step after extraction. Not sure.

So that is a fair bit of control. Getting it right is a bit tricky, and more thought should go into it.

I also thought about the { "FDSN": { ... }, "other": { ... } } idea. I like it structurally, but worry that the wasted bytes may be an issue. For many records where all you need is the simplest thing, like QI=68, you would end up with { "FDSN": { "QI":68 } }. Maybe I shouldn't be so worried, or if I am worried should generate an equivalent CBOR and see how it compares. Or maybe just use "F" and "O" for the two keys?

The uppercase/lowercase is simple and easy to enforce, but you are right it might be kind of subtle. With the regex, I think we could use a easy prefix and make it work, just the burden of wasted bytes for a repeated prefix. Several options, more exploration worth while I am sure.

chad-earthscope commented 6 years ago

I also thought about the { "FDSN": { ... }, "other": { ... } } idea. I like it structurally, but worry that the wasted bytes may be an issue. For many records where all you need is the simplest thing, like QI=68, you would end up with { "FDSN": { "QI":68 } }. Maybe I shouldn't be so worried, or if I am worried should generate an equivalent CBOR and see how it compares. Or maybe just use "F" and "O" for the two keys?

Just "F" or "R"eserved would be OK with me.

An alternative to "O"ther is to specify that each group that creates their own headers makes their own root object, so for South Carolina you could use "SC" and get a structure like:

{ 
"F": { "QI":68 },
"SC": { "CrotwellDetection":20180102T14:56:44.12Z }
}

It means that there is no cross-over, i.e. you cannot put custom things inside of the "F"DSN area, which is both good and bad. The schema for each root object could be documented separately by the groups that created them.

crotwell commented 6 years ago

Another option would be all fdsn stuff at the top level, with an "Other", or just "O", object to encapsulate non-fdsn items. So something like:

{ 
  "QI":68 ,
  "Other": {
      "SC": { "CrotwellDetection":20180102T14:56:44.12Z }
  }
}