Version Number and Backwards-Compatibility

krischer commented 7 years ago

Discussion branched off #2. Concerns DRAFT20170622.

@krischer

Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here?

@andres-h

It is hard to predict what will be needed in the forthcoming decades. The new SEED format should be used even for the archival of non-seismologic data. Who knows what other communities might need.

Some things are specific to manufacturers. For example, it has been complained that percentual timing quality is useless. Manufacturers could add their own specific timing quality info.

Besides, blockettes are IMO what make SEED SEED. If blockettes are replaced by another extension mechanism, which is even inferior (keys-values), the format should not be called SEED anymore.

If blockettes or similar are used, then a version number would basically not be needed, because new revisions of the standard just add new blockettes. This also follows the principles of object-oriented design (eg., the open/closed principle).

@chad-iris

Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here?

Your logic is what I tried to capture in the field description, and that means the minor version would only be updated when new reserved extra headers are added or maybe additions of data payload encodings or maybe new namespaces for identifiers. I changed what was a monotonically increasing integer to a major.minor to allow for those cases and what @andres-h said:

It is hard to predict what will be needed in the forthcoming decades.

A future major version may have more reasons for minor versioning.

Of course, this decision also has effects for the software ecosystem supporting the format. Updating major versions will very likely break any software downstream of a producer, which would be a big ripple and probably mean we do not do a major version update often (a good thing from a format perspective). A minor version allows, for example, adding a general compressor encoding in 3.1 and while allowing 3.0 readers to continue to read what they are able to read and provide time for updating downstream software that would not immediately see the new additions anyway.

It is a concession though, in that it reduces the major versioning from ~253 to 23 versions. I think there are legitimate arguments either way and, while I lean toward the minor version addition at the moment, would go with this group if any consensus emerges.

krischer commented 7 years ago

If blockettes or similar are used, then a version number would basically not be needed, because new revisions of the standard just add new blockettes. This also follows the principles of object-oriented design (eg., the open/closed principle).

This only fully works for Blockettes that contain optional information and don't alter the meaning of any other data. A parser not aware of for example Blockette 500 could not correctly parse any MiniSEED file with it as Blockette 500 potentially affects the timestamps of the samples. Ignoring it would thus result in wrongly interpreted data. This is IMHO the most dangerous situation when reading data as users might not notice and we should really aim to avoid that if possible.

This was always the danger with the Blockette mechanism and many things that ended up in blockettes should have been in the fixed header in my opinion.

It is hard to predict what will be needed in the forthcoming decades.

A future major version may have more reasons for minor versioning.

A future version could just add a minor version field if required ;-)

Of course, this decision also has effects for the software ecosystem supporting the format. Updating major versions will very likely break any software downstream of a producer, which would be a big ripple and probably mean we do not do a major version update often (a good thing from a format perspective). A minor version allows, for example, adding a general compressor encoding in 3.1 and while allowing 3.0 readers to continue to read what they are able to read and provide time for updating downstream software that would not immediately see the new additions anyway. It is a concession though, in that it reduces the major versioning from ~253 to 23 versions. I think there are legitimate arguments either way and, while I lean toward the minor version addition at the moment, would go with this group if any consensus emerges.

Adding a new compressor is indeed a valid use case for minor version numbers. But this could be worked around by defining the encoding format as forward-compatible, e.g. an encoding valid in a future version is also valid in the current one.

Given the speed our standardization process works I would not be worried about having only 23 version numbers available ;-)

All in all I personally would drop the minor version number because there is so very little that could safely be done with it.

crotwell commented 7 years ago

I lean towards a simple integer version. Given the complexity of the fdsn officially adopting a revision and the simplicity of the format, I can't really imagine a case where there would be a huge advantage to "3.1" over just calling the next version "4".

I also think that we should explicitly state that the list of data encodings is "append only" so that a new encoding being added does not require a new mseed version. The list of official encoding codes is kept as a separate document and a reader that encounters an unknown encoding type must be written to recognize that the encoding type is new/unknown and fail appropriately. This is not a big requirement as most readers will probably also have to fail appropriately for older encodings that they do not support. I certainly don't want to support all the old gain range styles.

chad-earthscope commented 7 years ago

A future version could just add a minor version field if required ;-) Indeed :|

@krischer

Adding a new compressor is indeed a valid use case for minor version numbers. But this could be worked around by defining the encoding format as forward-compatible, e.g. an encoding valid in a future version is also valid in the current one.

@crotwell

I also think that we should explicitly state that the list of data encodings is "append only" so that a new encoding being added does not require a new mseed version.

I think you are both saying a similar thing, we should including a notification that encodings may be added without incrementing the version.

I don't really think there is any alternative to readers checking the encoding explicitly. There is no "default" encoding. So this seems safe to me.

I don't think we need a separate document of encodings. The specification document will need a version, a date is my preference, that is separate from the format version anyway. We can release a new document with a new encoding without changing the format version.

krischer commented 7 years ago

Sounds good to me.

crotwell commented 7 years ago

OK with me too.

chad-earthscope commented 7 years ago

Added in DRAFT 20170708:

Format version number changed to a single integer in DRAFT 20170708.
Notification that encodings may be added in the future without updating the format version.

Another thought is to version the documentation with major.minor (in addition to updating a date), leaving the format to only have the major version. So when we add "Crotwell3" compression in 2022 we can say we added it in version 3.3 instead of version 3 revision 20220413.

krischer commented 7 years ago

Another thought is to version the documentation with major.minor (in addition to updating a date), leaving the format to only have the major version. So when we add "Crotwell3" compression in 2022 we can say we added it in version 3.3 instead of version 3 revision 20220413.

This might be tad confusing if I understand your proposal correctly. How about having two separate documents: one for the actual MiniSEED format and one to document the data encodings?

iris-edu / mseed3-evaluation

Version Number and Backwards-Compatibility #3