nationalparkservice / QCkit

QCkit provides useful functions for data quality control and manipulation including updating data to DarwinCore standards, unit conversions, and data flagging.
https://nationalparkservice.github.io/QCkit/
The Unlicense
5 stars 5 forks source link

add a function to help with multiple missing values code in a single column #107

Closed RobLBaker closed 3 months ago

RobLBaker commented 5 months ago

Data may be missing for multiple reasons: it was not collected for a valid reason, it is showing up for an invalid reason, or there may be multiple different valid/invalid reasons for data to be absent.

It can be important to document the multiple reasons for data that are missing and/or not present. It may also be important at the analysis level to translate these to the single missing value code for whatever analysis language/program is being used. However, the problem becomes how to preserve the information surrounding why two values with the same missing value code are missing for different reasons.

One solution is to generate a "helper" column to hold this information. These can be difficult to generate. A function to help generate the column would be handy. This function should scan a given column within a given file for a list of "missing value codes". It will replace these with NA (or whatever user defined value is supplied). It will then generate a second column (name based off of the first column) that contains the user-supplied missing value codes when they occur in the original column and fills in all the other values with something along the lines of "data present".