Tagging usable sensor data (or retaining uncertainty sources, quality information / comments in meta-data)

terraref / reference-data

Coordination of Data Products and Standards for TERRA reference data

https://terraref.org

BSD 3-Clause "New" or "Revised" License

9 stars 2 forks source link

Tagging usable sensor data (or retaining uncertainty sources, quality information / comments in meta-data) #130

Closed nshakoor closed 7 years ago

nshakoor commented 7 years ago

Over the last year, we have learned that some of the sensors were not collecting data properly or were uncalibrated at specific times.

Is it possible to clearly tag the dates from which the data from each of the sensors is usable/unusable? The methods for getting at this are up to you and your team @dlebauer, but we need to get a sense of how much of the data is usable for downstream analysis. This data also needs to be readily extracted from the uncalibrated/unusable data.

dlebauer commented 7 years ago

We shouldn't keep unusable data. Mostly we've been deleting data that are not usable.

At the same time we are processing data with the knowledge that we will update and rerun them before each release.

All alpha users are informed that none of the alpha and beta release data are intended to be reviewed, but only the v1 data is intended for research applications.

BETYdb does have a 'checked' flag that we use to indicate that the data have been independently reviewed and are considered scientifically valid. But we can come up with a more sophisticated scheme for the dataset as a whole.

I think the key here is to clearly state the assumptions and uncertainties associated with each dataset.

dlebauer commented 7 years ago

We should start with a list of all sources of error and uncertainty that we can think of, a review of best practices, examples from NEON and NASA, and a draft protocol.

Let's discuss at the next Thursday and Tuesday meetings.

nshakoor commented 7 years ago

Sounds good- as an extension of this, we would like to be able to answer (at any given time), how much good data we have available for each sensor.

dlebauer commented 7 years ago

@nshakoor the challenge is defining 'good'.

dlebauer commented 7 years ago

Here is the NEON Algorithm Theoretical Basis Document for their hyperspectral imaging sensor. See sections on uncertainty and validation neon_hyperspectral_atbd.pdf

max-zilla commented 7 years ago

I'm tagging @craig-willis here as well. My first reaction is that the same funnel that cleans metadata of incoming files could perform some basic checks to flag datasets as good or not at that point.

That way we can also avoid data going into clowder we don't want as well so it doesn't cause headaches when files are deleted from the filesystem.

ghost commented 7 years ago

Does anyone have any examples of databases that do this well? Should the value be qualitative (ie good vs. bad) or qualitative (ie uncertainty value)? Should we use propagation error from calculations or comparison to ground truth?

dlebauer commented 7 years ago

Next steps: Flush out protocol for assessing dataset quality and uncertainty. Closing this and moving to #138

https://docs.google.com/document/d/1hWqkowvopYqGkeckSWg-_JzN3-DIS36rCBCFI09Sqyk/edit#