An important aspect before analysing or publishing data is to check whether the dataset does not contain any major integrity errors, such as missing dates, coordinates, values not meeting controlled vocabularies or relationships between tables not being correct. Although validation is possible with the Python software Frictionless Framework, for most users the returned error messages are hard to parse.
camtraptor (or the frictionless R package) could offer some basic data validation (easier to implement than the entire metadata and data validation Frictionless Framework offers).
Users can correct issues by retrieving data (#232), correct errors and updating the package (#248).
A user facing validate() function could make use of a number of check_ helper functions. Those helper functions could also be run by other functions, e.g. when updating data (#248).
Suggestions for functions:
[ ] validate(package)
[ ] check_relations(package): relationships are valid
[ ] check_identifiers(package, "table name"): IDs are unique
[ ] check_required(package, "table name"): required fields are populated
[ ] check_vocabularies(package, "table name"): values meet factor levels. Note that read_resource()/readr() converts these to factors and might throw problems()
[ ] check_data_types(package, "table name"): note that read_resource()/readr() will throw problems() but otherwise will do a best attempt at converting
[ ] check_timestamps(package, table name"): has timezone, start <= end (specific to camtraptor, not a frictionless thing)
[ ] check_durations(package): obs & media timestamps within deployment (specific to camtraptor not a frictionless thing)
While it would be useful if these were functions of the frictionless R package, it might not be what we expect for camtraptor. Frictionless would have its validation run on resources (i.e. csv files + schemas), since returned data frames lose the connection with their schema, so it is not possible to validate for relationships or unique, as that information is lost. Camtraptor on the other hand, wants to validate the (already read) data frames.
Suggested in camtraptor July 2023 coding sprint
An important aspect before analysing or publishing data is to check whether the dataset does not contain any major integrity errors, such as missing dates, coordinates, values not meeting controlled vocabularies or relationships between tables not being correct. Although validation is possible with the Python software Frictionless Framework, for most users the returned error messages are hard to parse.
validate()
function could make use of a number ofcheck_
helper functions. Those helper functions could also be run by other functions, e.g. when updating data (#248).Suggestions for functions:
validate(package)
check_relations(package)
: relationships are validcheck_identifiers(package, "table name")
: IDs are uniquecheck_required(package, "table name")
: required fields are populatedcheck_vocabularies(package, "table name")
: values meet factor levels. Note thatread_resource()
/readr()
converts these to factors and might throwproblems()
check_data_types(package, "table name")
: note thatread_resource()
/readr()
will throwproblems()
but otherwise will do a best attempt at convertingcheck_timestamps(package, table name")
: has timezone, start <= end (specific to camtraptor, not a frictionless thing)check_durations(package)
: obs & media timestamps within deployment (specific to camtraptor not a frictionless thing)While it would be useful if these were functions of the frictionless R package, it might not be what we expect for camtraptor. Frictionless would have its validation run on resources (i.e. csv files + schemas), since returned data frames lose the connection with their schema, so it is not possible to validate for relationships or unique, as that information is lost. Camtraptor on the other hand, wants to validate the (already read) data frames.