iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

miniSEED3 draft 20170622 specification target #2

Closed chad-earthscope closed 7 years ago

chad-earthscope commented 7 years ago

Attached is a draft of a miniSEED 3 specification that combines and hopefully addresses all of the feedback received from the straw man and the discussions at the "future of miniSEED" meeting in the Netherlands in February 2017.

What is drafted is a complete standard where references to the previous standard are informational only, i.e it is stand-alone. This encourages consideration of all/most aspects, it should not be treated like a proposal, at least not until we have some agreement. Some of the included documentation, such as SEED identifiers, would necessarily be shared with an FDSN StationXML revision where these changes are incorporated.

The largest changes from the original straw man discussed last year are from the February "future of miniSEED" meeting and include:

a) Replacement of traditional identifiers (network, station, location, channel) with a URN with flexibility in length. The specification contains a definition of how the traditional identifiers could be mapped to/from a URN.

b) Re-arrangement to allow streaming of a record during generation with as much flexibility as possible. In this construction, a record is composed of a header, data payload and footer.

The changes to allow stream-abilty only require definition of the length of the data payload prior to it being sent. The entire record length does not need to be known, which allows shipment of the data payload and addition of "extra" headers such as event flags in the footer.

I also added the definition of a transient sub-header for transmitting the data payload in chunks. Allowing multiple channels to be transmitted the same stream-based communication link (TCP-IP). It would be a non-trivial thing to decode on the receiving end as the payload would have to be searched for the signature of of the sub-header. But without something like this I don't know how multiple channels could be transmitted in a multiplexed fashion on the same communication link. The complexity added to the record structure to facilitate streaming only a single channel does not seem worth it as that would be a very limited scenario. Other ideas, of course, are welcome.

Of course, details are all up for discussion. This is an attempt to provide a target. We should try to agree on a target ASAP in order to have some time for implementation and evaluation.

miniSEED3-DRAFT20170622.pdf

krischer commented 7 years ago

Some thoughts/comments on DRAFT20170622.

Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here?

Fixed Header

Section 4: FDSN Identifiers

Section 5: Definition of transient data payload sub-header

Is this really necessary? This should IMHO be delegated to some lower level protocol. TCP can do multiplexing, as can HTTP/2, and who knows what will be out in 10 years.

Section 6: Definition of channel codes

I would like to define the X band code for any synthetic data. Currently X is only defined for the instrument code but this limits it to translational synthetic data. There is currently no option to for example specify synthetic rotational or strain channels. Here are some of my notes of how we currently deal with it in a new code of ours:

- Always use “X” as the band code - its currently not used and
we just claim it for “synthetics”. The current FDSN conventions
just don’t really work for synthetics so this should be fine.

# Acoustic - final “A” would stand for acoustic

* displacement: XDA + tag “displacement”
* velocity: XVA + tag “velocity”
* acceleration: XAA + tag “acceleration”

# Elastic

## Receivers with rotation matrix:

* displacement: XD[ZNE] + tag “displacement”
* velocity: XV[ZNE] + tag “velocity”
* acceleration: XA[ZNE] + tag “acceleration”
* strain: XS[0-5] in Voigt notation + tag “strain”
* gradient: X[ZZ-EE] + tag “gradient”

## Receivers without rotation matrix:

* displacement: XD[012] + tag “displacement”
* velocity: XV[012] + tag “velocity”
* acceleration: XA[012] + tag “acceleration”
* strain: XS[0-5] in Voigt notation + tag “strain” # Not unique and the
# same as for receivers with rotation matrix - not sure what to do here.
* gradient: X[00-22] + tag “gradient”

Section 7: Data encoding codes

Why keep the legacy codes?

Section 8: Definition of reserved, extra header fields

andres-h commented 7 years ago

On Thursday 2017-06-29 23:10, Lion Krischer wrote:

Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here?

It is hard to predict what will be needed in the forthcoming decades. The new SEED format should be used even for the archival of non-seismologic data. Who knows what other communities might need.

Some things are specific to manufacturers. For example, it has been complained that percentual timing quality is useless. Manufacturers could add their own specific timing quality info.

Besides, blockettes are IMO what make SEED SEED. If blockettes are replaced by another extension mechanism, which is even inferior (keys-values), the format should not be called SEED anymore.

If blockettes or similar are used, then a version number would basically not be needed, because new revisions of the standard just add new blockettes. This also follows the principles of object-oriented design (eg., the open/closed principle).

Why keep the legacy codes?

It would be nice to have a possibility to include MS2 data without modification. We should IMO get rid of the byteorder bit, though, which is ambiguous and has caused so much pain in MS2. Use a fixed byteorder in the header and different encoding types for big-endian and little-endian variant of data encodings.

andres-h commented 7 years ago

Some things are specific to manufacturers. For example, it has been complained that percentual timing quality is useless. Manufacturers could add their own specific timing quality info.

One more thing that would be nice to have is an optional "instrument ID" that unambiguously identifies the response. Could have separate IDs for sensor and datalogger that are manufacturer-specific or refer to NRL.

chad-earthscope commented 7 years ago

Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here?

Your logic is what I tried to capture in the field description, and that means the minor version would only be updated when new reserved extra headers are added or maybe additions of data payload encodings or maybe new namespaces for identifiers. I changed what was a monotonically increasing integer to a major.minor to allow for those cases and what @andres-h said:

It is hard to predict what will be needed in the forthcoming decades.

A future major version may have more reasons for minor versioning.

Of course, this decision also has effects for the software ecosystem supporting the format. Updating major versions will very likely break any software downstream of a producer, which would be a big ripple and probably mean we do not do a major version update often (a good thing from a format perspective). A minor version allows, for example, adding a general compressor encoding in 3.1 and while allowing 3.0 readers to continue to read what they are able to read and provide time for updating downstream software that would not immediately see the new additions anyway.

It is a concession though, in that it reduces the major versioning from ~253 to 23 versions. I think there are legitimate arguments either way and, while I lean toward the minor version addition at the moment, would go with this group if any consensus emerges.

Maybe add a flag to indicate “Leap second present” for records that do contain a leap second but do not start at it. This would allow operators to figure out (after the fact) if the digitiser correctly recognised a leap second or not.

The extra header "TimeLeapSecond" is exactly what you describe. If you mean a bit flag in the fixed header, then I am strongly against it. That's what we had before and it is a total and utter mess because when it's in the fixed header it's a required bit and in the vast majority of cases it's set incorrectly. I never believe this bit. The solution is to make is something that is pro-actively added by a data generator.

Field 6: It could also recommend to set it to NaN if it is no time series data.

Unless there is a reason for not using 0, I think we should stick with it for two reasons a) it's what we use now and b) it makes usage just a tad bit harder as programmers now need to specify NaN in whatever language, testing for zero is dead simple in every language.

An aside on where I'm coming from: over the last couple of years spent thinking about the next generation of miniSEED I have come around to a philosophy that it should be as complex as needs to be an not any more. Put another way, keep every aspect as absolutely simple as possible to achieve the real needs. The motivating factors are usability as broadly across computing environments as possible and ensuring future use. When you see someone implementing a miniSEED parser in JavaScript (haha, that was a funny notion 10 years ago) you see all the barriers, small but present, that get in the way. Probably most would agree with the general statement, I write it so you know where I'm coming from when I say things like stick with zeros instead of NaNs.

Maybe allow underscores and dashes in all the codes? Might be useful for example for somewhat semantic names of sensors arranged in a grid.

Hhmm, interesting. It could make for cluttered looking IDs, but I see the use. Curious what others think, I'll ask around the DMC.

Why are the maximum lengths of all the FDSN identifiers still kind of short? If we allow for fairly long time series identifiers, why not also give more space to the identifiers?

Practical reasons. These identifiers are in each record, lots of redundancy. In our domain, folks are used to "reading" or "saying" the identifiers, where concise is valuable. The current lengths are designed to fit the future needs we can see, 8 character networks can be much more meaningful (can include 4 character years for temporary networks), station code hasn't been a problem, location holds what is needed to identify nodes in arrays of 100,000's of sensors and channel now has room for a lot more instrument identifiers. In a way, it's now double the address space.

Also, in the future we can decide on a new namespace for identifiers, e.g. "FDSN2:", and come up with expanded or completely new schemes. This variable length identifier concept is one of the best changes we've discussed in next general miniSEED in my opinion. Provides a lot of future proof-ness.

This question needs to be turned around, why should we give the identifiers more space?

Regarding the location identifiers: would it be possible to define some semantics for it? The current state is honestly confusing. But maybe this is not the right place to discuss this.

Yes, and I think we should define at least some convention but maybe even a required use semantics. The specification is probably the right place for this.

Suggestions welcome.

To reduce redundancy the formal “urn:” prefix is not included in the time series identifiers. - Should probably be rephrased to something that states that FDSN: is the default urn, if nothing else is specified. I would actually tend to always force a prefix.

Perhaps there is some confusion? I did not mean the "FDSN:" namespace identifier, I meant the literal "urn:" that is part of a formal URN. Similar to how you would need to add a "doi:" prefix to a value in a field that is already identified as a DOI.

Is this really necessary? This should IMHO be delegated to some lower level protocol. TCP can do multiplexing, as can HTTP/2, and who knows what will be out in 10 years.

Yeah, I agree with you and @andres-h, this was a misfire. I was attempting to providing a way to uniformly multiplex record fragments over any general communications link, but I'm happy to relegate it to the transmission protocol.

I would like to define the X band code for any synthetic data.

Sounds like a decent proposal to me. Using an X band eliminates the ability to denote the band, but it's a course definition anyway.

Since that is a completely new channel definition I suggest this goes to FDSN WG II as a proposal and not something we conflate with the format specification. There is already a lot of new format layout conflated with new format semantics, I suggest separating what we can.

Why keep the legacy codes?

What @andres-h said, for forward compatibility without re-encoding. At the DMC we have converted almost all of the data in those legacy encodings to Steim# as a step towards getting rid of them, but providing the path forward is still needed.

Perhaps the wording can be stronger, instead of "not recommended", it could be "deprecated, do not use for new data".

The “header” portion of the entry must be treated case insensitively - This only works if it is restricted to ASCII characters - otherwise this is not well defined. Currently only reserved headers are limited to ASCII.

Agreed. I'll remove the bit about treating them case insensitively, such that they need to match exactly.

Should all of these maybe be name-spaced? E.g. MS24:TQ=

Adds a bit of bloat to them. Perhaps we should require that any non-reserved headers include a namespace? And let the default be FDSN reserved. It would be a step towards keeping potential conflicts at bay. To avoid conflicts completely we'd probably need a registry of namespaces managed by the FDSN. Thoughts?

chad-earthscope commented 7 years ago

We should IMO get rid of the byteorder bit, though, which is ambiguous and has caused so much pain in MS2. Use a fixed byteorder in the header and different encoding types for big-endian and little-endian variant of data encodings.

Good idea @andres-h. Some people may be concerned that the fixed order is not ideal given that architectures (embedded, etc.) vary, but the clarity is probably worth it.

Got a preferred byte order for the values in a fixed order?

andres-h commented 7 years ago

Good idea @andres-h. Some people may be concerned that the fixed order is not ideal given that architectures (embedded, etc.) vary, but the clarity is probably worth it.

Got a preferred byte order for the values in a fixed order?

Normally I would prefer big-endian, because that is the canonical "network byte order", however, standard varint is AFAIK little-endian, so if we use varints, we should consider using little-endian everywhere.

crotwell commented 7 years ago

Hi all Was on vacation, so just now weighing in. Philip

Header

I am a little confused about your byte-order discussion? Do you mean that we pick a single byte order for the header and eliminate bit 0 from field 3 (flags)? Or are you talking about separating header byte order from data byte order? I do like the idea of the encoding including the byte order where if it might not be the same as the header, so 3 is big endian 32 bit integer and 43 or something else is little endian 32 bit integer. Dealing with the bit flags separately is a pain in the rumpus. And byte order might not make sense for new compression types that might be added later, for example ascii or an encoding that itself includes byte order information.

If we pick one order, then I tend to like big endian.

Field 7 should probably also say set to 0 if no data payload like Field 6 and 9.

Field 8 Maybe reserve values < 10 for raw data and qc types of things and values >= 10 for user modified data. The dividing line is whether the metadata still applies, so below 10, the response is still the response. But once the version is above 10, be careful as the response may have already been applied or the data modified to the extent that it no longer can be. In other words, below 10 users can proceed normally, above 10 "here be dragons" and you better know the history. Is 10 large enough?

Field 9, consider UINT32. It is really nice for processing data to be able to store a long continuous time series as a single record like SAC and 65K is kind of small for that. I have no problem with a recommendation that data loggers only generate small (~512 or 4096) or data centers choose a maximum for acceptance or internal storage. The header allows UINT32 samples but not enough bytes to put them in.

Section 6: Definition of channel codes

In the Water Current section, add a sentence to say that water current channels must NOT use SOH or LOG to avoid conflicting with existing soh and log channel names.

For synthetic data, we now have the option of longer codes, so "real" data channel codes should be limited to 3-4 characters, but synthetic or other can be longer, prefixed with X, so XBHZ or XLSN. Then even a new instrument code that was "JK" could be synthetic with BJKN mapping to XBJKN? This kind of matches Lion's idea, except make is explicit that band of X means that it is synthetic amd that the rest of the channel code can be interpreted along standard channel naming conventions, or is undefined by the spec? The restriction of short codes maybe makes sense for "real" data, but maybe should be relaxed for synthetic or highly processed data, thinking of miniseed of stacked data for example.

The example on page 15 for 2 char instrument codes sets a bad precedent as it makes it seem as if the 2 char WU in LWUS is a subtype of instruments of type W. But I think a least part of the reason for expanding the code is to allow for completely new types, so WU might be a "foobar meter" and have nothing to do with a "wind speed" instrument. In any case, instrument codes should be "as specified by fdsn" and not user definable? I worry that someone will look at the example and decide to create BH1Z and BH2Z channels because they want to have 2 BHZ channels.

Along those lines, could we just have a "location code" and call all 4 code things "codes" instead of network, station and channel "codes" and location "identifier"?

Section 8 Extra Header

Make any identifier that starts with an upper case ASCII letter be reserved to be defined by FDSN. Anything that starts with a lower case or other UTF character is user-defined? All of your existing words already meet that requirement, and it provides an easy way to separate fdsn from other without prefix-bloat.

+1 on a standard key-value to identify logger and sensor type and serial numbers. Although getting the sensor correct is unlikely to be automatic as the logger that builds the records may only know that it has voltages on pins and have no idea of the sensor connected to it. You probably want a "model" and a "serial number" as model is often sufficient to get nominal response while serial helps with calibrated response and with inventory control.

Leap Seconds, may not want to limit it to "during" this record. If we accept that leap seconds can only happen at month ends and have always been at the end of June or December, then it may be beneficial to add the "leap second happened" header to records that are near but not overlapping the actual leap second. For example a record starting 20150701T00:00:01 could have the leap second extra header to let the system know that this time included the leap second applied in the previous seconds even though it doesn't actually overlap. Putting the value into records preceeding the leap second would also allow systems to warn "hey, leap second coming up" before the record with the leap in it actually happens. I would say the flag should be recommended to be set in records that overlap or are within a small time interval (minutes?) of the actual leap second? The meaning is then "leap second occurs near" instead of "leap second occurs inside" this record?

I have more thoughts on structure in extra headers, but will post to that issue.

andres-h commented 7 years ago

On 07/03/2017 06:06 PM, Philip Crotwell wrote:

I am a little confused about your byte-order discussion? Do you mean that we pick a single byte order for the header and eliminate bit 0 from field 3 (flags)?

Yes.

I do like the idea of the encoding including the byte order where if it might not be the same as the header, so 3 is big endian 32 bit integer and 43 or something else is little endian 32 bit integer.

Exactly.

+1 on a standard key-value to identify logger and sensor type and serial numbers. Although getting the sensor correct is unlikely to be automatic as the logger that builds the records may only know that it has voltages on pins and have no idea of the sensor connected to it.

There can be a plug-and-play protocol, like VGA monitors were recognized by the graphics card using I2C.

You probably want a "model" and a "serial number" as model is often sufficient to get nominal response while serial helps with calibrated response and with inventory control.

Indeed, I forgot the serial number.

krischer commented 7 years ago

I opened a bunch of new issues with some of the discussions which makes it easier (at least for me) to follow them all. We can also close them once we've reached consensus.

krischer commented 7 years ago

~ Field 6: It could also recommend to set it to NaN if it is no time series data.

Unless there is a reason for not using 0, I think we should stick with it for two reasons a) it's what we use now and b) it makes usage just a tad bit harder as programmers now need to specify NaN in whatever language, testing for zero is dead simple in every language.

An aside on where I'm coming from: over the last couple of years spent thinking about the next generation of miniSEED I have come around to a philosophy that it should be as complex as needs to be an not any more. Put another way, keep every aspect as absolutely simple as possible to achieve the real needs. The motivating factors are usability as broadly across computing environments as possible and ensuring future use. When you see someone implementing a miniSEED parser in JavaScript (haha, that was a funny notion 10 years ago) you see all the barriers, small but present, that get in the way. Probably most would agree with the general statement, I write it so you know where I'm coming from when I say things like stick with zeros instead of NaNs.

Fair enough and I think I agree.

Along those lines, could we just have a "location code" and call all 4 code things "codes" instead of network, station and channel "codes" and location "identifier"?

:+1:

chad-earthscope commented 7 years ago

Draft 20170708 supersedes this draft.