miniSEED3 DRAFT 20170708 specification

iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats

3 stars 1 forks source link

miniSEED3 DRAFT 20170708 specification #21

Open chad-earthscope opened 7 years ago

chad-earthscope commented 7 years ago

In the attached drafts I have tried to incorporate all the items that we have discussed and according to what appeared to be consensus.

The specification of FDSN identifiers from miniSEED 3 have been split off so these can be treated separately and, ultimately, so the documentation of identifiers may be shared with StationXML.

Here are the other changes:

Allow dashes in station and location codes My interpretation: I do not see why we need both underscores and dashes are needed, and underscores are ugly (my bias). Also, they cannot be used in channels as each letter is proscribed. I'm also guessing there would be pushback from adding dash/underscore/etc to network codes and that didn't seem to be a use case. If you want changes included speak up and justify your case.
All identifier pieces are “codes”, all except channel can be 8 characters.
Versioning, a major version with no minor.
Remove telemetry sub-header
Refactor structure to header, data and termination blocks. The data block can be repeated as much as desired. Solves streaming of data blocks as they are generated and effectively makes the record length unlimited.
Drop legacy encodings, add general compressor and opaque encodings, add note to not use old encoding values.
Fixed byte order in header to little endian, encodings now defined with a byte order. Note: I did not add encodings for big endian fundamental types (int, float), I'd rather convert those types when converting from mseed2.
Include a notification that encodings may be added without incrementing the format version.

I tried to evenly apply the changes we all agreed, or at least there were two positive votes and no push back. Please speak up if you think I got something wrong. Perhaps I'm in the fog of many hours of refactoring/combining everything, but I'm optimistic that this demonstrates a good amount of convergence.

Things that still need treatment:

Extra header encoding/serialization
Location identifier semantics
General compression, this is a whole sub-topic to be explored
Add an event detector type field? Maybe the "EventDetector" suffices for this?

miniSEED3-DRAFT20170708.pdf FDSNIdentifiers-DRAFT20170708.pdf

crotwell commented 7 years ago

First read I think this is very good. I feel like we are converging on something very workable and I am happy with the structure.

@chad-iris In your copious free time (ha ha), can you close the issues that you think are "resolved" by this new version? If any of the rest of us objects to the resolution, then we should open a new issue w.r.t. this new document and have it link back to the old issue. Otherwise our issue list just keep growing without any closure. Seem reasonable?

andres-h commented 7 years ago

Allow dashes in station and location codes

I don't care, but there are colleagues who are religious about this...

All identifier pieces are “codes”

👍

all except channel can be 8 characters.

OK if that is the FSDN naming schema. Multiple identifiers should be allowed, though. Currently the FDSN identifier must be dropped when a different URN is used, which is maybe not in the interest of FDSN.

Versioning, a major version with no minor.

Remove telemetry sub-header

👍

Drop legacy encodings, add general compressor and opaque encodings, add note to not use old encoding values.

OK. BTW, is Steim3 used anywhere!?

Fixed byte order in header to little endian, encodings now defined with a byte order.

👍

Refactor structure to header, data and termination blocks. The data block can be repeated as much as desired.

This is a restricted special case of my chunks concept. The record now consists of the following "chunks":

HEADER(MS) WFDATA(DA) TERMINATION(TE)

Instead of numeric chunk IDs, two-letter codes and implicit length are used. The blocks again suffer from "over-lumping", eg., sample rate/period and encoding format are only applicable if data blocks are present, and sample rate/period is not applicable to all encodings.

What to do if microsecond resolution turns out to be insufficient in the future? (Maybe in other domains than seismology?)

At least there is some extensibility -- new block types can be added. Maybe "SE" for sensor ID (VID, PID, serial, preset) and "DL" for datalogger ID (VID, PID, serial, preset)...

I would also add an optional gain or "GA" block, in case there is gain reduction between the sensor and datalogger, such as used with some of our EarthData units...

PS. I haven't been able to read the comments on the white paper yet, because I don't have a Google account. I hope Angelo will send me a current copy of the white paper on Monday.

chad-earthscope commented 7 years ago

OK. BTW, is Steim3 used anywhere!?

Not that I have ever seen. The Q330s apparently use some variation of it in their transfer protocol, maybe internally too.

The blocks again suffer from "over-lumping", eg., sample rate/period and encoding format are only applicable if data blocks are present, and sample rate/period is not applicable to all encodings.

True. But, the vast majority of records will use those fields for known and anticipated uses. This is about finding right balance for the number one goal of time series data for FDSN members while allowing as much flexibility for other uses as possible without making it worse for the number one goal. In this case, is having a different block with small but non-zero overhead worth it compared to the few cases where those fields are not empty.

What to do if microsecond resolution turns out to be insufficient in the future? (Maybe in other domains than seismology?)

Change the format and increment the version. Alternatively, add more resolution as an extra header (no my favorite).

At least there is some extensibility -- new block types can be added. Maybe "SE" for sensor ID (VID, PID, serial, preset) and "DL" for datalogger ID (VID, PID, serial, preset)...

I would also add an optional gain or "GA" block, in case there is gain reduction between the sensor and datalogger, such as used with some of our EarthData units...

There is a huge amount of extensibility in the extra headers, and I think those kinds of things belong in the extra headers by default. We'd need a very good reason to create another block type in my opinion.

andres-h commented 7 years ago

On Sunday 2017-07-09 20:22, Chad Trabant wrote:

 The blocks again suffer from "over-lumping", eg., sample rate/period and encoding format are
 only applicable if data blocks are present, and sample rate/period is not applicable to all
 encodings.
True. But, the vast majority of records will use those fields for known and anticipated uses. This is about finding right balance for the number one goal of time series data for FDSN members while allowing as much flexibility for other uses as possible without making it worse for the number one goal.

I think chunks add a huge amount of flexibility without making it any worse for the number one goal. I'm even more confident after implementing the format in both Python and Javascript.

 What to do if microsecond resolution turns out to be insufficient in the future? (Maybe in
 other domains than seismology?)

Change the format and increment the version.

Fantastic. Then there will be at least another chance to get things right.

 At least there is some extensibility -- new block types can be added. Maybe "SE" for sensor
 ID (VID, PID, serial, preset) and "DL" for datalogger ID (VID, PID, serial, preset)...

 I would also add an optional gain or "GA" block, in case there is gain reduction between the
 sensor and datalogger, such as used with some of our EarthData units...
There is a huge amount of extensibility in the extra headers, and I think those kinds of things belong in the extra headers by default. We'd need a very good reason to create another block type in my opinion.

I don't see any reason for wasting space with extra headers. I'm sure many users agree and you can expect lots of different blocks to be used soon. There will be just no controlled way for allocating the IDs.

krischer commented 7 years ago

The draft seems like quite some progress! What exactly is the thinking behind the opaque data encoding? It this needed for some actual use case?

Also should the text encoding for data encoding 0 be specified? It should probably be ASCII or UTF-8.

kaestli commented 7 years ago

Hi, maybe some topics were discussed elsewhere - it is difficult to get an overview as i joined the discussion late. Here a selection of comments:

Record header block indicator and version - collapse to one field. Typically different formats have different structure, resulting of the version to be stored somewhere else anyway...

FLAGS: drop flags. they have unclear time reference: refer either to a point in time, or to a time interval, but neither is explicitlitly expressed. recoding /realigning information in different records changes the interpretation / leads to information loss. this is a very fundamental flaw in a data format, should not happen in the 20th century... The information contained in the flag needs to be stored in waveform quality metadata we have better concepts for things like that in WFparam or Mustang

RECORD START TIME: precision is not sufficient. We sample with 50 MHz even in seismology (analysis of rock samples), and even today. Not future proof for a generic time series format

Sampling rate: precision not sufficient.

Data version: different versions of data should have different stream IDs, resolving to different metadata (which tells how the data was treated to make it a different version). Data version as a separate data field without link to metadata is pointless in data exchange. Furthermore if not being part of a globally unique IDs, the same version tags may be used in different places for different stuff in different places.

Data blocks: i do not see the point of having multiple blocks. With no crc per block, you have to wait anyway for the termination block to see whether you got everything right. Tentative forward readibility is given also without sub-blocks if the compression format allows (you know about potential inherent block structures from the encoding field. With one data block present per header block, number of samples goes to the header, length of data payload is derived from overall length minus header and footer, and indication is not required (is implicit by position after the header). What remains, is pure data ...

btw, multiple data records make reading/searching quite ineffective: you have to jump to each record in order to figure out the position of the footer, and to go to the next header.

Footer: drop identificator (with one data record, the position of a fixed-length footer is given anyway. drop extra headers. what is currently collected there, is seismological legacy, metadata and seismological event data - nothing belonging to a generic time series data. If somebody wants to have all that information in old-style "somehow" intermingled with the time series, he/she is free to use SEED or SAC format ;-)

On the new FDSN identifiers: i actually introduced the NSLC mapping to an opaque, but unique URI as a one-way path to represent legacy identification (Seed 2.4 appendices) while leaving users all freedom of an URI for future stream identification. It was not meant as an invitation to invent new legacy and have people parsing URIs and looking at the Nth letter after the Mth dot to get some information on the sampling rate (which is in the header anyway) or the instrument type (which should be better described in metadata anyway). If you want to stick to forcing people to NSLC hierarchy and guessing partial metadata from classification letters, one can also go ahead with fixed, but slightly larger data fields for N, S, L, and C. (however this is not generic for all types of time series and application cases , nor future proof.

crotwell commented 7 years ago

Agree on multiple data blocks.

Disagree on extra headers. I feel a core requirement is that we be able to migrate mseed2 in a way that is lossless. Dropping extra headers makes this impossible.

I am NOT in favor of opaque URI identifiers. Network, station and channel are too fundamental.

chad-earthscope commented 6 years ago

@krischer

The draft seems like quite some progress! What exactly is the thinking behind the opaque data encoding? It this needed for some actual use case?

To take the role of mseed2 blockette 2000. I do not know of a specific use case currently, but a general way to allow packaging and transportation of (presumably time series) data in a payload that is not expected to be generally known. Of course the risk is accumulating unusable data records. Perhaps there could be a requirement to set an OpaquePayload extra header describing what it is for any records with this encoding.

I believe mseed2 blockette 2000 was originally created for packing GPS BINEX data blocks into miniSEED so that it could be transported in a system designed for miniSEED. As far as I know there are none of those around as the miniSEED packaging was subsequently stripped off at the data center. In this scenario, going through an approval process to define a new encoding is not worth it.
I think the hope was that blockette 2000 would be useful beyond this original case. But it's really not, just another failed blockette. I do not think we should promote the idea of opaque data, but I predict there will be other cases such as described. An opaque encoding is a minimal concession to allow such usage, where we might otherwise get bastardized scenarios like usage of not-yet defined encoding values (and potential future conflict).

chad-earthscope commented 6 years ago

@kaestli

Thanks for the comments.

Record header block indicator and version - collapse to one field. Typically different formats have different structure, resulting of the version to be stored somewhere else anyway...

Makes it simpler 👍

FLAGS: drop flags. Footer: drop extra headers.

We need a way to map mseed2 data to mseed3. Losing information is a non-starter for many. If you have an alternative way to incorporate mseed2 information that would not immediately be legacy cruft please describe it. This whole process would be a lot easier if we could dream up the best new format for current and future needs; but that is not the case, we must provide a transition path for a lot of old data.

Extra fields also provide a mechanism for data generators (operators, equipment manufacturers, etc.) to put their own values into the header. This has been requested many times over the years. What you see in the reserved extra headers now is just the mseed2 flags/blockettes, but the real value is a flexible extra header structure for things to come, future protection.

Data version: different versions of data should have different stream IDs, resolving to different metadata (which tells how the data was treated to make it a different version).

The version in the URN is an interesting idea. It would need to be optional so that data could be referenced, for example in a request, without a version because the default for nearly every data center is "the latest".

Data version as a separate data field without link to metadata is pointless in data exchange. Furthermore if not being part of a globally unique IDs, the same version tags may be used in different places for different stuff in different places.

If defined as relative to a data center, it has value to know if an extracted copy is the current version later in time. It is not nearly as valuable as a linked to metadata, but that requires much more change to be realizable I would think. Could you expand on specifics of what you think would be required to actually have versioned identifiers and metadata?

btw, multiple data records make reading/searching quite ineffective: you have to jump to each record in order to figure out the position of the footer, and to go to the next header.

Agreed, that's a general problem with an arbitrary blocking style structure, you have to walk the blocks to find anything not in the first block.

On the new FDSN identifiers: i actually introduced the NSLC mapping to an opaque,

We have a large legacy of identifiers that cannot be ignored, moving to an opaque system would be a big mistake as many aspects of it, e.g. easy network identification, are extremely useful.

If you want to stick to forcing people to NSLC hierarchy and guessing partial metadata from classification letters, one can also go ahead with fixed, but slightly larger data fields for N, S, L, and C. (however this is not generic for all types of time series and application cases , nor future proof.

The size of each code can be discussed and I'm sure will be. Justification will be needed in any case as there are impacts in the real world for "slightly larger".

The namespace of "FDSN" makes it future proof in the most important way, in the future the FDSN could create a new namespace for a new identifier. If someone wants to use the format for time series not defined by the FDSN they can use another identifier. Can you explain how this is not future proof in a way that matters?