Marking error datasets and warnings/caveats

terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data

BSD 3-Clause "New" or "Revised" License

21 stars 13 forks source link

Marking error datasets and warnings/caveats #575

Closed max-zilla closed 5 years ago

max-zilla commented 5 years ago

Have an ERROR.txt or something in the dataset & on disk to indicate dataset should be skipped for processing.

dlebauer commented 5 years ago

See https://github.com/terraref/reference-data/issues/218

For data that a human has recognized as being in error (e.g. w/ blurry FLIR data, point clouds clipped at some height)

add a text file named "ERROR" with optional content / explanation / pointer to a github issue, could be a set of key: value pairs, perhaps in yaml that get parsed directly to json metadata
have an extractor that finds these files and adds a tag "quality"{ "ERROR" = "TRUE", "description":"", "key2":"value2"
add a a general rule perhaps at the level of extractors, or perhaps at the level of RabbitMQ that says any time an error (file or flag) is found, skip processing the dataset.

dlebauer commented 5 years ago

For example (consider this a draft; should probably use a consistent / standard way of encoding this information), every FLIR dataset following May 2017 could have a file named "ERROR.yml" that contains:

quality:
  status: ERROR
  description: Sand and water contaminated FLIR Camera lens so temperature values are invalid  
  url: https://github.com/terraref/reference-data/issues/182

max-zilla commented 5 years ago

duplicate of #557

max-zilla commented 5 years ago

there are different severities of error - FLIR 2017 is prominent example, but the stair-stepping on the laser3D data is not so cut and dry and may still have valuable data in it

max-zilla commented 5 years ago

Proposed script will add new metadata entry from Maricopa Field user with body like:

{
"quality": "ERROR",
"description": "Sand and water contaminated FLIR Camera lens so temperature values are invalid", 
"url": https://github.com/terraref/reference-data/issues/182"
}

Other status could be WARNING, ADVISORY, etc. We would also write a corresponding yaml file to the globus directory with those contents as suggested, perhaps at the day level rather than repeated for entire dataset? or do we want it repeated at the dataset (timestamp) level?

dlebauer commented 5 years ago

We should also add a file named "ERROR" that contains the description and url to the affected dataset. I think repeating this at the dataset level will be good. There may be use for having a tag at a higher level, but that would be in addition to the dataset level flag.

max-zilla commented 5 years ago

My script is prepared to generate the YAML files & metadata, however due to the raw_data directories being owned by dlebauer I am unable to write into them. We can discuss how to handle this during meeting... probably one of:

run as sudo and chmod the yaml files when created so they are owned by dlebauer consistently with the others
have dlebauer run the script

dlebauer commented 5 years ago

I am glad that the raw data folder is locked down. I am not sure it makes sense for me to be the folder owner (as opposed to a user or group like ‘terraref’ but ... the idea is that we don’t touch the raw_data folder.

In the end, the key requirement is that any data that have known errors (or other issues) are clearly labeled as such.

It makes sense (at this point) to have to use sudo to touch the raw_data folder, if we should ever touch it at all. But maybe there is a ‘better’ way to handle this. Certainly none of the existing files should be touched, but allowing the same user that transfers the files to be able to create a new file would also seem reasonable.

For the FLIR, we did process the data to Level 1. Is the plan to also add an error file to the level 1 data?

max-zilla commented 5 years ago

Script is running now. Will close this when completed.

I would argue that in the FLIR case we don't add an error to Level_1 data, the goal was to flag the raw so that in the future we dont even process these erroneous datasets. I would make an argument for deleting level 1 + data from this time period for FLIR to be consistent with that.

max-zilla commented 5 years ago

standardized taxonomy for different error cases - ERROR that should not be sent through processing vs. WARNING vs. other classes

max-zilla commented 5 years ago

Max will write up some documentation/wiki once this is done to propose a standard approach to handling this.

max-zilla commented 5 years ago

support ability to explicitly define files (vs entire dataset) - 'all' could be default value.

max-zilla commented 5 years ago

Created #589 to follow this.