Create `validate()` and helper functions to validate integrity of a Camtrap DP

peterdesmet commented 1 year ago

Suggested in camtraptor July 2023 coding sprint

An important aspect before analysing or publishing data is to check whether the dataset does not contain any major integrity errors, such as missing dates, coordinates, values not meeting controlled vocabularies or relationships between tables not being correct. Although validation is possible with the Python software Frictionless Framework, for most users the returned error messages are hard to parse.

camtraptor (or the frictionless R package) could offer some basic data validation (easier to implement than the entire metadata and data validation Frictionless Framework offers).
Users can correct issues by retrieving data (#232), correct errors and updating the package (#248).
A user facing validate() function could make use of a number of check_ helper functions. Those helper functions could also be run by other functions, e.g. when updating data (#248).

Suggestions for functions:

[ ] validate(package)
[ ] check_relations(package): relationships are valid
[ ] check_identifiers(package, "table name"): IDs are unique
[ ] check_required(package, "table name"): required fields are populated
[ ] check_vocabularies(package, "table name"): values meet factor levels. Note that read_resource()/readr() converts these to factors and might throw problems()
[ ] check_data_types(package, "table name"): note that read_resource()/readr() will throw problems() but otherwise will do a best attempt at converting
[ ] check_timestamps(package, table name"): has timezone, start <= end (specific to camtraptor, not a frictionless thing)
[ ] check_durations(package): obs & media timestamps within deployment (specific to camtraptor not a frictionless thing)

While it would be useful if these were functions of the frictionless R package, it might not be what we expect for camtraptor. Frictionless would have its validation run on resources (i.e. csv files + schemas), since returned data frames lose the connection with their schema, so it is not possible to validate for relationships or unique, as that information is lost. Camtraptor on the other hand, wants to validate the (already read) data frames.

damianooldoni commented 2 months ago

@peterdesmet: wondering if we should move this issue to camtrapdp repo.

peterdesmet commented 2 months ago

Ah yes, will do.

inbo / camtrapdp

Create `validate()` and helper functions to validate integrity of a Camtrap DP #58