Data Version Field - Githubissues

krischer commented 7 years ago

Discussion branched off #2. Concerns DRAFT20170622.

@crotwell

Field 8 Maybe reserve values < 10 for raw data and qc types of things and values >= 10 for user modified data. The dividing line is whether the metadata still applies, so below 10, the response is still the response. But once the version is above 10, be careful as the response may have already been applied or the data modified to the extent that it no longer can be. In other words, below 10 users can proceed normally, above 10 "here be dragons" and you better know the history. Is 10 large enough?

krischer commented 7 years ago

Field 8 Maybe reserve values < 10 for raw data and qc types of things and values >= 10 for user modified data. The dividing line is whether the metadata still applies, so below 10, the response is still the response. But once the version is above 10, be careful as the response may have already been applied or the data modified to the extent that it no longer can be. In other words, below 10 users can proceed normally, above 10 "here be dragons" and you better know the history. Is 10 large enough?

I feel like this would be a bit dangerous. I think only data centers should set this field and we should recommend that all processing software sets this field to 0 upon writing.

crotwell commented 7 years ago

Question, does this change after quality control procedures if nothing about the data was changed? In other words, is this a version of the data that increments on change, or a state of processing that indicates whether it was retrieved before or after qc?

I am ok with either but maybe it should be explicit what causes the value to increment.

I also now think that any user modification versioning should go into the extra stuff. Would be good to have a standard key for it, but not part of fixed header.

chad-earthscope commented 7 years ago

This is what was in the 20170622 draft:

Recommended values: 1 for raw data, 2 for data following quality control procedures, and the value is incremented for each later revision. A value of 0 indicates unknown version such as when data are converted to miniSEED from another format.

@crotwell

In other words, is this a version of the data that increments on change, or a state of processing that indicates whether it was retrieved before or after qc?

I would think it increments on change or, when incremented, implies that something might have changed.

The motivation for IRIS is that we have been using the quality indicator as a crude form of versioning, when we deliver data we look for M, Q, D, R qualities in that order and deliver the first that we find (merging other, lower qualities as needed). This was much more useful for versioning than an indication of quality, given that the meaning of D and Q changes from operator to operator, making it more or less useless to the user for any real "qc" indication. Furthermore, it's too limiting for versioning: we've received multiple copies of Q, replacing a whole data set each time, and we have no way to differentiate the copies of Q in the data.

So I would like a "version" with the primary goal of identifying a later copy of the data, and try to mostly steer aware from semantic meaning beyond a very few classes: 0=converted, 1=raw, and >1=later.

The meaning of any of the versions for a particular time series is probably best kept outside of the miniSEED. The version would provide a reference.

I feel like this would be a bit dangerous. I think only data centers should set this field and we should recommend that all processing software sets this field to 0 upon writing.

Agreed, except we should be able to robustly separate the case of 0=converted, 1=raw. So sac2mseed should write version 0 so it can be identified separately from Nanometrics equipment writing the raw data with version 1.

In the DMC's case we would let the operator/owner set the version and we would only change it when the operator is no longer available or in coordination with the operator when changing the data for some reason.

crotwell commented 7 years ago

Small thing, but maybe the "unknown" value should 255 and the original data logger raw should be 0. I can envision confusing as it looks like data might transition from 0 to 1, but 255 is obviously different.

chad-earthscope commented 7 years ago

Small thing, but maybe the "unknown" value should 255 and the original data logger raw should be 0. I can envision confusing as it looks like data might transition from 0 to 1, but 255 is obviously different.

Problem with "unknown" = 255 is that it's bigger than any other version, so the straightforward test for identifying "later" would always need a special case check. I don't think most people will see this, so it won't have a chance to be obvious, it'll be a program doing a check instead.

There will probably be some muddling between versions 0 and 1 by data generators that did not adhere to the recommendations and that's OK. They key principle is that a larger number the later and more preferred the data in any exchange scenario remains in effect.

crotwell commented 7 years ago

But don't isn't unknown a special case always? Shouldn't "user modified" be > "raw"? Not sure I understand, if I download the raw data, apply the response and save it, we said I was supposed to set the version the "unknown" value, but the data is later than the original data?

Maybe we should have 2 special cases, 0 for unknown as converted on input for use by datacenters, and 255 for modified by the end user post-datacenter, meaning the QC trail no longer applies?

krischer commented 7 years ago

Sounds good to me. We should just clearly specify that this value is for data-center operational use only and the semantics can differ per data-center. Also the spec should specify that all data converters and processing software should set this value to zero.

chad-earthscope commented 7 years ago

But don't isn't unknown a special case always? Shouldn't "user modified" be > "raw"? Not sure I understand, if I download the raw data, apply the response and save it, we said I was supposed to set the version the "unknown" value, but the data is later than the original data?

For me it's a data center thing, which is good for user <-> data center coordination but otherwise not for users beyond informational. If users do use it, and how would we prevent them, they need to keep track of their versions and what they mean. I would recommend that users write extra header(s) to keep track of some processing steps.

crotwell commented 7 years ago

@chad-iris So we should drop the "user sets to unknown value on write"? If it is a data center thing, and the end user is not supposed to use it, then they should not change it, ever. So even after processing steps, it stays the same. The meaning is then is not "data version" but "the QC level of the data at the time it was retrieved from the datacenter".

This also allows a user, even after lots of processing steps, to see if the datacenter has issued a new version of the same data and to choose to reprocess in that case.

If that is what you mean, then 👍 from me.

chad-earthscope commented 7 years ago

@crotwell That's mostly what I mean, exception is:

"the QC level of the data at the time it was retrieved from the datacenter"

I'm very hesitant to attach much notion of quality, unless the definition of quality is "last is best" with no deeper meaning.

crotwell commented 7 years ago

@chad-iris Right, my typo, meaning is "the data version at the time is was retrieved". Larger is better, meaning is up to the individual data center. 👍

crotwell commented 7 years ago

We might need a better name for this, "data version" makes it sound like this is the current version of the data, which is what I was trying say it is not.

Maybe something like "publication version"?

iris-edu / mseed3-evaluation

Data Version Field #12