Closed nshakoor closed 7 years ago
We shouldn't keep unusable data. Mostly we've been deleting data that are not usable.
At the same time we are processing data with the knowledge that we will update and rerun them before each release.
All alpha users are informed that none of the alpha and beta release data are intended to be reviewed, but only the v1 data is intended for research applications.
BETYdb does have a 'checked' flag that we use to indicate that the data have been independently reviewed and are considered scientifically valid. But we can come up with a more sophisticated scheme for the dataset as a whole.
I think the key here is to clearly state the assumptions and uncertainties associated with each dataset.
We should start with a list of all sources of error and uncertainty that we can think of, a review of best practices, examples from NEON and NASA, and a draft protocol.
Let's discuss at the next Thursday and Tuesday meetings.
Sounds good- as an extension of this, we would like to be able to answer (at any given time), how much good data we have available for each sensor.
@nshakoor the challenge is defining 'good'.
Here is the NEON Algorithm Theoretical Basis Document for their hyperspectral imaging sensor. See sections on uncertainty and validation neon_hyperspectral_atbd.pdf
I'm tagging @craig-willis here as well. My first reaction is that the same funnel that cleans metadata of incoming files could perform some basic checks to flag datasets as good or not at that point.
That way we can also avoid data going into clowder we don't want as well so it doesn't cause headaches when files are deleted from the filesystem.
Does anyone have any examples of databases that do this well? Should the value be qualitative (ie good vs. bad) or qualitative (ie uncertainty value)? Should we use propagation error from calculations or comparison to ground truth?
Next steps: Flush out protocol for assessing dataset quality and uncertainty. Closing this and moving to #138
https://docs.google.com/document/d/1hWqkowvopYqGkeckSWg-_JzN3-DIS36rCBCFI09Sqyk/edit#
Over the last year, we have learned that some of the sensors were not collecting data properly or were uncalibrated at specific times.
Is it possible to clearly tag the dates from which the data from each of the sensors is usable/unusable? The methods for getting at this are up to you and your team @dlebauer, but we need to get a sense of how much of the data is usable for downstream analysis. This data also needs to be readily extracted from the uncalibrated/unusable data.