traitecoevo / traits.build

Source for the traits.build R package, used to build AusTraits
https://traitecoevo.github.io/traits.build/
Other
8 stars 0 forks source link

[traits.build adding studies functions] Better options for checking data quality #137

Open ehwenk opened 11 months ago

ehwenk commented 11 months ago

After updating adding_data_long in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll call dataset_check functions.

These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of dataset_test because there isn't the assumption that they can/will be "solved".

The code for some of the "tricks" are still in adding_data_long (generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:

dataset_check_categorical_substitutions - outputs substitutions required (based on excluded data) dataset_check_numeric_values - outputs out-of-range numeric values (i.e. those in excluded data) dataset_check_taxonomic_updates - outputs list of original names not in taxon_list dataset_check_not_pivoting - identifies the measurements that are preventing a dataset from pivoting dataset_check_outlier_by_genus - for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait genus > 10 (or some other #) dataset_check_outlier_by_species - for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait species> 10 (or some other #) dataset_check_duplicates_across_datasets - search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range) dataset_check_duplicates_within_dataset - raise flag if all values for given taxon*trait are identical (i.e. growth form)

@yangsophieee or I have code for all of them except dataset_check_duplicates_across_datasets (old, in-need-of-rewrite code from a few years ago).

What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?

ehwenk commented 11 months ago

See https://github.com/traitecoevo/traits.build-book/commit/cbe6cebc8026074cf524b28197e3f19d2eee38eb