iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

Chunks format #14

Open andres-h opened 7 years ago

andres-h commented 7 years ago

It might help others understand your complete vision. Also, I find that when you document, even in rough draft, something in a near-complete description you are pushed to think through the details and it helps identify problems that are not obvious from a very general concept view.

Not a draft, but I wrote a couple of crude Python scripts. ms2to3 converts an MS2 file to MS3 chunks format. msi3 lists all chunks in a file. Sorry that there are not many comments, but the scripts are very short. The scripts are tested with Python3 only.

Few things to note:

andres-h commented 7 years ago

How is the schema specified? Is that part of your proposal?

The schema language is of course independent of the data format. See a simplistic schema in Python here. The final standard should have language-independent schema language, maybe based on JSON, but it can be very simple. Specification of the schema language could be included in the appendix.

chad-earthscope commented 7 years ago

WFDATA = ChunkType("WFDATA", 20, "<fBH", 7, "sample_rate", "encoding", "number_of_samples", "data")

It doesn't seem like repeating sample_rate and encoding in each waveform chunk is needed. I wouldn't think those should change within a record.

MS2BLK = { 100: MS2BLK100, 200: MS2BLK200, 201: MS2BLK201, 300: MS2BLK300, 310: MS2BLK310, 320: MS2BLK320, 390: MS2BLK390, 395: MS2BLK395, 400: MS2BLK400, 405: MS2BLK405, 500: MS2BLK500 }

The old blockettes all suffer from one problem or another, wasted "reserved" bytes, over-lumping or over-requirements.

If this is just for a demo fine. But we will miss a huge opportunity to create a better extra headers in miniSEED if we just embed them, they should be mapped to something better when converting.

andres-h commented 7 years ago

On 07/06/2017 02:37 AM, Chad Trabant wrote:

WFDATA = ChunkType("WFDATA", 20, "<fBH", 7,
"sample_rate",
"encoding",
"number_of_samples",
"data")

It doesn't seem like repeating |sample_rate| and |encoding| in each waveform chunk is needed. I wouldn't think those should change within a record.

Agreed. I did not want to put |sample_rate| and |encoding| into the fixed header, because these are not applicable to all kinds of data, but there could be a separate block for those.

MS2BLK = {
100: MS2BLK100,
200: MS2BLK200,
201: MS2BLK201,
300: MS2BLK300,
310: MS2BLK310,
320: MS2BLK320,
390: MS2BLK390,
395: MS2BLK395,
400: MS2BLK400,
405: MS2BLK405,
500: MS2BLK500
}

The old blockettes all suffer from one problem or another, wasted "reserved" bytes, over-lumping or over-requirements.

If this is just for a demo fine. But we will miss a huge opportunity to create a better extra headers in miniSEED if we just embed them, they should be mapped to something better when converting.

True, but if you add all this info to MS3, you create lots of extra headers or blocks that will be rarely if ever used.

If you drop something, some guy who does use obscure MS2 blockettes will complain.

Same with MS2 flags. There should be a way to include all of those flags in MS3, even if they will be never used. I don't like discarding any information, especially if you physically convert the archive.

andres-h commented 7 years ago

I think everything is now clear in my head, so I'm ready to write a full draft. If @chad-iris wants to do another draft first, it would be OK with me, though.

chad-earthscope commented 7 years ago

True, but if you add all this info to MS3, you create lots of extra headers or blocks that will be rarely if ever used.

Turns out it's simply not that much, look at any of the posted drafts. So even if they are rarely used who cares, they are there for conversion of data from mseed2.

If you drop something, some guy who does use obscure MS2 blockettes will complain.

Same with MS2 flags. There should be a way to include all of those flags in MS3, even if they will be never used. I don't like discarding any information, especially if you physically convert the archive.

Nothing was dropped in the 20170622 draft so I'm not sure what you mean here. There were some structural problems, as you pointed out multiple event detection "occurrences" could not be grouped very nicely, but there was no loss of information from mseed2.

andres-h commented 7 years ago

@crotwell I ported your javascript to chunks format, see this. Looks nice IMO. Saving MS3 data is now possible too (result can be verified with msi3.py).

crotwell commented 7 years ago

It is a little hard to evaluate as code, would be easier to talk about if we had a document, but this looks like there would be a huge bloat factor due to storing the key and size for everything. This is really flexible, but the flexibility comes at a cost of wasted bytes and parsing complexity. For example when parsing, there is no order enforcement that I can see, so I may have to parse most of the record just to find the start time or identifier? It is much easier in a fixed case to know you can get the identifier at offset 25 bytes.

I think that all of the fixed header as we have defined it is stuff that really always needs to be there, and this idea of making everything key-length-value is too much flexibility and comes at too great a cost. Everything is about trade-offs.

crotwell commented 7 years ago

Also relevant to many chunks each with its own size, see issue #25. Total record size should be easily calculated without reading/parsing many chunks. Specifically, in this you have to search for the fdsn.NULL_CHUNK to find the end of the record.

I think it is really important to be able to load an entire record into memory or skip to the next record without having to do many reads to sum up sizes.

andres-h commented 7 years ago

It is a little hard to evaluate as code, would be easier to talk about if we had a document

It's option #3 in the white paper...

there would be a huge bloat factor due to storing the key and size for everything.

Much less than with JSON.

I think that all of the fixed header as we have defined it is stuff that really always needs to be there

Don't agree. Minimal fixed header would be OK, but only for stuff that is fundamental.

Also relevant to many chunks each with its own size, see issue #25. Total record size should be easily calculated without reading/parsing many chunks.

It's not possible with Chad's proposal either.

I think it is really important to be able to load an entire record into memory or skip to the next record without having to do many reads to sum up sizes.

Totally agreed. That's why I prefer fixed length records. See comments here.

andres-h commented 7 years ago

It is a little hard to evaluate as code, would be easier to talk about if we had a document

I started to write better documentation here.

crotwell commented 7 years ago

there would be a huge bloat factor due to storing the key and size for everything.

Much less than with JSON.

Nobody is suggesting storing the header as json. Most records would have very minimal extra headers. That is kind of the point, fix things that have to be there, use flexible storage for things that are optional or new without a format revision.

Have you done a byte size calculation for converting an average mseed2 record to your style?

I think that all of the fixed header as we have defined it is stuff that really always needs to be there

Don't agree. Minimal fixed header would be OK, but only for stuff that is fundamental.

What exactly do you think is not fundamental in the latest fixed header? Maybe "Flags" or "data version". I do not see the point of a miniseed replacement where the sample rate and encoding format are optional. That would just repeat the blockette1000 problem.

andres-h commented 7 years ago

Have you done a byte size calculation for converting an average mseed2 record to your style?

Yes, as written in the first post of this thread, the size jumped from 512 to 522 bytes, but that was with sensor and datalogger ID included. Otherwise the size would have been shorter than mseed2.

What exactly do you think is not fundamental in the latest fixed header?

Flags Time -> can be absolute or relative (simulations, synthetic data) Time series identifier -> multiple identifiers (FDSN and non-FDSN) should be allowed Sample rate/period, encoding -> not applicable to non-waveform data etc.

I have a multi-header concept now. It's all documented here.

crotwell commented 7 years ago

Small thing, but a serial number is often not an actual number, very often string mix of letters and numbers.

Keeping a registry for vendor/product codes feels like more work than the fdsn is capable of providing on an ongoing basis. Maybe just strings for all of these would be more usable. USB collects fees from big companies, fdsn is effectively all volunteer.

I do like the idea of the logger being able to tag data with its serial number and such. Might be good to be as compatible as possible with what is in stationxml, given byte size limitations.

andres-h commented 7 years ago

Thanks for feedback :)

Small thing, but a serial number is often not an actual number, very > often string mix of letters and numbers. Yes, I was thinking about that, but I'd like to not waste too much space. It is unlikely that more than 65536 items of a single model are produced, so the instrument should somehow be able to put its actual serial number there.

Keeping a registry for vendor/product codes feels like more work than the fdsn is capable of providing on an ongoing basis. Maybe just strings for all of these would be more usable.

I think strings are not usable and can be even dangerous. For example, if you put Guralp something there, there might be multiple versions of Guralp something with different gains and filters.

USB collects fees from big companies, fdsn is effectively all volunteer.

Yes, it's possible that the idea will never realize :(

chad-earthscope commented 7 years ago

Small thing, but a serial number is often not an actual number, very > often string mix of letters and numbers.

Yes, I was thinking about that, but I'd like to not waste too much space. It is unlikely that more than 65536 items of a single model are produced, so the instrument should somehow be able to put its actual serial number there.

The actual serial number is not always digits.

crotwell commented 7 years ago

USB collects fees from big companies, fdsn is effectively all volunteer. Yes, it's possible that the idea will never realize :(

Kind of the same issue with the chunk id number registry. I just don't see the FDSN being able to support that long term.

I guess I just don't understand, why design a format that depends on something external that even you don't think is likely to exist?

andres-h commented 7 years ago

Kind of the same issue with the chunk id number registry. I just don't see the FDSN being able to support that long term.

I guess I just don't understand, why design a format that depends on something external that even you don't think is likely to exist? What's the problem with chunk registry? All FDSN chunks are documented in the standard. If there are other chunks, the corresponding organization or manufacturer must make the schema available. Can be a single URL from where you can download the schema.

I guess this should be clarified in the white paper.

crotwell commented 7 years ago

There has to be a registry for the "Allocation of blockette types", right? I can't just choose a random number and write chunks, can I?

If I find a chunk with ID 123456789, how do I find the URL that goes with it?

andres-h commented 7 years ago

There has to be a registry for the "Allocation of blockette types", right? I can't just choose a random number and write chunks, can I?

If I find a chunk with ID 123456789, how do I find the URL that goes with it? Indeed there has to be a "master table" that links ID ranges to URLs from where you can download the schema. I don't think implementing this would be a problem for IRIS. We at GEOFON DC can do this for free :)

The master table would not change often (only when a new organization or manufacturer is added or a URL is changed), so it would be feasible to ship it with software. Offline copy of the schemata can be included too.

chad-earthscope commented 7 years ago

I started to write better documentation here.

Thanks, this is much better to review. My comments are in regard to this version of your document: https://github.com/iris-edu/mseed3-evaluation/wiki/Chunks/385b84eda92dd48974ee7c49e8b0b4c81ed0bd37

Section 2:

"This standard documents only archive record header, which is used with MS3 files. Real-time transfer protocols may use a different header."

Section 3: Blockettes

In exceptional cases, new revisions of the standard may append fields to existing blockettes (this was a practice in miniSEED 2.x)...

Order For efficiency reasons, essential blockettes (eg., time series identifier, record start time) should occur near the beginning of a record.

In this case, assuming that only one instance of a blockette per record is allowed, and knowing the record length, it would be possible to skip to next record as soon as all relevant blockettes are found.

This is an excellent example of how "all blockettes" makes doing a common operation like reading through a stream of records and skipping some of them a more difficult task compared to a fixed header of core values. So to skip through records based on identifiers and/or times I have to go about searching through the blockette chain to find the right blockettes for each record. Add in the possibility of multiple identifiers and it gets worse. This is arguably, the most common operation for a data center and would certainly get more expensive.

100000..199999 reserved for IRIS extensions 200000..299999 reserved for EIDA extensions

Section 4: Definition of standard blockettes:

Sensor (10) Optional sensor identification. Datalogger (11) Optional datalogger (digitizer) identification.

Gain (12)

Waveform metadata (20) Sample rate/period FLOAT32

Waveform data (21) Large waveform data (22)

In general

This structuring pushes more complexity onto the readers. This is very important because the records will be read much, much more often than they are written, modified or streamed in real-time. As pointed out above, even simple operations of reading through files/streams of miniSEED to subset them (something done millions of times per day at our data centers) is more complex, i.e. expensive. Other examples include needing to check for dependee/depender blockette ordering, checking for duplicated records that should not be duplicated, needing to know the structure & schema of extra headers to even print anything about them, needing to do varints, two blockette types for time series data, potentially multiple versions of the same blockette types, potentially multiple blockettes with the same information (if we keep re-defining them to try and get it right) and probably more. If size were the main driver then this structure has an advantage over other opions we've discussed. Although even in that case I would look for better waveform compression before making the header even more complex to save bytes.

The specification is also missing quite some detail, in particular there are not enough details to losslessly convert mseed2 to this format. Previously it has been suggested that mseed2 blockettes would be inserted verbatim into mseed3 blockettes. I think this would be a mistake for two main reasons: 1) many mseed2 blockettes are terribly constructed and all suffer from some problems and 2) most but not all of them that exist today are big endian, so we'd have little-endian structuring containing mostly big-endian blockettes (but some little-endian) with no byte order flag to know the difference.

P.S. Minor, but statements like "a sane value" (General structure) expose bias and judgement and do not belong in format specifications. Other values are insane? In the text that you copied I had written actual reasons for recommending records less than ~4096 bytes.

crotwell commented 6 years ago

Indeed there has to be a "master table" that links ID ranges to URLs from where you can download the schema.

I find this troubling. The chunks are opaque unless you have the schema, which might be maintained outside of the FDSN. And so if a company goes out of business and the link dies, the information could easily become unparsable, unreadable and essentially garbage? Do we really want to create a data format that encourages data to be stored in a way that might not even be able to be parsed decades from now? This is just asking for bit rot!

andres-h commented 6 years ago

Many thanks for the exhaustive review. I appreciate this.

Everything becomes a blockette. While this seems simple and would be
quite flexible, it invokes a number of subtle problems. For any
blockettes that include more than one field there is a very real
problem of over-grouping, as demonstrated in mseed2. Such grouping
is totally rigid, making fields required that should not necessarily
be if the blockette is desired for any one of the fields. An
alternative approach is to make each field a blockette, which incurs
more overhead of the structuring (possible minor), and really just
becomes a smaller, binary version of the tilde-separated string
headers with the problems we've identified with that.

I don't really see a problem. Grouping would be used where it is natural, other blockettes would have single values (only 2 bytes overhead per blockette).

A possible
solution is to create alternatives of different blockettes that
group fields differently; that could be really messy indeed.

Another alternative would be using an encoding where fields can be unset (eg., Protobuf).

Binary blockettes for extra, non-FDSN headers provide complete
flexibility, possible too much because the content is absolutely
opaque without a definition of the structure and the meaning. The
other structuring options for extra headers we've discussed at least
give the reader basic data types, this would allow, for example,
printing of the extra headers without knowing anything about the
contents. Any of the approaches would need some sort of schema
definition for full use of any extra headers, but wide-open binary
blockettes force small parsing engines for each different header.

You mean type 127? I wouldn't recommend to use that, but it could be used for things like SeedLink INFO packets (XML). I guess a future SeedLink would use 126 and JSON, though.

"This standard documents only archive record header, which is used
with MS3
files. Real-time transfer protocols may use a different header."

*

Can you expand on what advantage there would be to use a different
header? Also, do you mean "real-time" transfer could potentially use
an alternate header? Or a header in addition to the standard header?
If the former, should there be a requirement that a standard header
be added at some later time? If the later then may this doesn't
belong in the spec at all.

I think including the archive header in real-time transfer is pointless and problematic, because the record length is not known in advance. Moreover, real-time header needs other fields to correctly assemble the records on the receiving side. I think a future SeedLink packet would look like this:

<SeedLink header (incl. stream ID)>...

The packet would not contain whole record, just one or more blockettes.

I will clarify this in the next revision.

I do not think padding should be allowed. It's a waste and is only
used as a kludge to address other problems that I can tell. If you
want padding in a file form some kind of storage/access pattern, add
it between records, it should not be allowed within as that forces
everyone handling that record to pay the penalty.

OK, I think I can agree with that.

varints save a few bytes but add a bit of complexity. Their value
increases if there are /lots/ of blockettes or blockette ids are
really huge numbers. I think we want a format that is as easy to use
as possible and this trade off is not worth it.

In the allocation that I suggested, no ID would take more than 3 bytes and I think IDs that are larger than 1 byte would be rare.

  Section 3: Blockettes

In exceptional cases, new revisions of the standard may append
fields to existing blockettes (this was a practice in miniSEED 2.x)...
  • Versions of the same blockettes would be terrible, suggest dropping that entirely. This one of the lessons of mseed2 we do not want to re-learn.

Like I said, "in exceptional cases". I think there should be a way to amend blockettes if we forget something. Like "git commit --amend". It is discouraged, but can be used if needed.

The problem with mseed2 was that blockette lengths were not defined. You could only guess the length from "next blockette's byte number" or "beginning of data", both of which can be unset.

*Order* For efficiency reasons, essential blockettes (eg., time
series identifier, record start time) should occur near the
beginning of a record.
  • I think you meant usability or something other than "efficiency" or I don't understand which characteristic is more/less efficient depending on blockette order. Definitely not size.

    In this case, assuming that only one instance of a blockette per record is allowed, and knowing the record length, it would be possible to skip to next record as soon as all relevant blockettes are found.

This is an excellent example of how "all blockettes" makes doing a common operation like reading through a stream of records and skipping some of them a more difficult task compared to a fixed header of core values. So to skip through records based on identifiers and/or times I have to go about searching through the blockette chain to find the right blockettes for each record. Add in the possibility of multiple identifiers and it gets worse. This is arguably, /the/ most common operation for a data center and would certainly get more expensive.

This is exactly what I mean with efficiency. The core values would be in the beginning of record and only one instance would be allowed, so you don't have to search through the blockette chain.

100000..199999 reserved for IRIS extensions
200000..299999 reserved for EIDA extensions
  • I do not think that is a good idea. We don't even know if those organizations will be around for the expected lifetime of the format and what about other groups?

How many big datacentres/federations are there? I think there are enough IDs for everyone.

I guess datacentres may want to define their quality control blockettes or things like that, but the ID ranges could be allocated on demand. Probably many are not even interested.

  Section 4: Definition of standard blockettes:
  • Multiple time series identifiers? I see downstream problems identifying the data, e.g. at data centers are we expected to allow track unlimited aliases /per-record/ that may vary over time and allow requests for any of the aliases? Without a preferred/primary ID are systems that report on data supposed to provide all IDs? The first in the record could be identified as the preferred/primary. I understand the desire, but not sure we can justify putting it in every record. I think aliases for time series identifiers fit much better in external metadata.

There would be one FDSN identifier allowed and FDSN web services like dataselect would use only that.

Alternative identifiers could be used by groups like ETH who need opaque URI identifiers, for example.

Sensor (10) Optional sensor identification.
Datalogger (11) Optional datalogger (digitizer) identification.
  • Both of these are over-grouped. Those fields will not all be appropriate for every case where some of this information is known and would otherwise be useful.

You know vendor ID, but not product ID? I don't think this would be useful.

Or you know vendor ID, product ID, but have no idea which filter and gain settings are in effect? IMO not useful either.

The serial number can be set to 0 if not known. I don't think it would be that bad.

We also need a channel ID, though. See below.

Also does not handle serial numbers
that are not all digits or prefixed with zeros or have dashes, etc. etc.

I think there are two options:

  1. Serial number would be a variable length string.

  2. (preferred) Manufacturers provide numeric serial number in addition to the fancy one.

Gain (12)
  • I like this in concept, could be used in scenarios where instrument gain changes dynamically (not sure how common that is) but there are some problems. It's probably worth clarifying whether this is the the total sensitivity (that's how SEED refers to total system gain), and if it is maybe units could be reported also? It may be challenging to define the standard for each combination. Also, I do not understand by "The value 1.0 corresponds to standard gain" would be used, why not just put in the gain so it's directly usable?

It's not the total sensitivity.

The problem with total sensitivity is that it is only valid for given frequency and in case of polynomial response, there is no gain at all (you need to specify the polynomial).

Eg. (EarthData digitizer): temperature = counts / 10.0 - 50.0

Instead of specifying all that (and units), I'd prefer to refer to the devices.

In fact, we must add channel ID (Z, N, E, voltage, temperature, etc.).

Waveform metadata (20)
Sample rate/period FLOAT32
  • Might need to be FLOAT64 if we are pushing the time resolution up.

Agreed.

The specification is also missing quite some detail, in particular there are not enough details to losslessly convert mseed2 to this format. Previously it has been suggested that mseed2 blockettes would be inserted verbatim into mseed3 blockettes. I think this would be a mistake for two main reasons: 1) many mseed2 blockettes are terribly constructed and all suffer from some problems and 2) most but not all of them that exist today are big endian, so we'd have little-endian structuring containing mostly big-endian blockettes (but some little-endian) with no byte order flag to know the difference.

The specification is a work in progress and is missing many details. Regarding mseed2 blockettes, I agree that creating mseed3 version of them instead of copying the data verbatim would make sense. At least the blockettes would have to be converted to little-endian.

P.S. Minor, but statements like "a sane value" (General structure) expose bias and judgement and do not belong in format specifications. Other values are insane? In the text that you copied I had written actual reasons for recommending records less than ~4096 bytes.

Of course. My English is not perfect and the wording can be improved a lot. Probably I copied the text from 20170622 draft (I had both drafts open on my screen), which only said that "typical record lengths are between 256 and 4096 bytes".

andres-h commented 6 years ago

On 07/13/2017 05:00 PM, Andres Heinloo wrote:

Another alternative would be using an encoding where fields can be unset (eg., Protobuf).

In fact, I think we should give Protobuf another look, because it is quite similar to chunks/blockettes, but more flexible.

Each field has an ID (varint) -> similar to chunks/blockettes.

A field can be present or not -> similar to chunks/blockettes.

A bunch of fields can be concatenated -> similar to chunks/blockettes.

A field can be a single value or an embedded message (eg., chunk/blockette).

The embedded message has again fields with ID -> fields of chunk/blockette can be optional/unset.

Besides, the encoding looks quite simple and there is even RFC.

andres-h commented 6 years ago

New revision submitted.

Edit: removed record terminator, because it does not make much sense without padding now.

andres-h commented 6 years ago

I committed an implementation of protobuf encoding in Python, no .proto files needed.

Might do Javascript as well.

andres-h commented 6 years ago

I've committed a Javascript implementation, including a .proto file. Result can be verified using msi3-protobuf.py.

andres-h commented 6 years ago

Note that there are two things that should not be confused with each other. It wasn't fully clear even to myself in the beginning.

One is the protobuf encoding, which is very simple and fits perfectly with my chunks concept. The specification of protobuf encoding takes just a couple of pages.

Two is various protobuf toolsets whose purpose is to provide highly efficient encoders and decoders for various languages. There are two versions of them developed by Google (proto2 and proto3), both of which use the same encoding!

Using those toolsets is not a requirement to parse MS3 records, but they can be used to implement highly efficient parsers.

My Python implementation (not the most efficient one) uses small bits of Google's code, just because I was too lazy to write all code myself.

My Javascript implementation is based on protobufjs. I just love this package! It's so elegant and easy to parse miniSEED now, not to mention that all problems with the miniSEED format are solved that I can think of.

PS. If microseconds resolution is needed in most use cases, it might make sense to have required microseconds field and an optional nanoseconds (0..999) (or picoseconds) field.

crotwell commented 6 years ago

@andres-h You should change the author in the package.json in both of your javascript examples to be you instead of me.

crotwell commented 6 years ago

Also, please remove my name from the javascript as well. What you have done is such a huge change from what I did that I don't feel there is any reason to keep my name on it.

andres-h commented 6 years ago

OK, I haven't posted this anywhere else. Will do the changes with the next update.

Your work served as a very useful base actually. I probably wouldn't have done Javascript without it.

crotwell commented 6 years ago

OK, thanks.

I am happy you were able to reuse my code. No bad feelings, I just think having my name there was confusing.