After updating adding_data_long in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll call dataset_check functions.
These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of dataset_test because there isn't the assumption that they can/will be "solved".
The code for some of the "tricks" are still in adding_data_long (generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:
dataset_check_categorical_substitutions - outputs substitutions required (based on excluded data)
dataset_check_numeric_values - outputs out-of-range numeric values (i.e. those in excluded data)
dataset_check_taxonomic_updates - outputs list of original names not in taxon_list
dataset_check_not_pivoting - identifies the measurements that are preventing a dataset from pivoting
dataset_check_outlier_by_genus - for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait genus > 10 (or some other #)
dataset_check_outlier_by_species - for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait species> 10 (or some other #)
dataset_check_duplicates_across_datasets - search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range)
dataset_check_duplicates_within_dataset - raise flag if all values for given taxon*trait are identical (i.e. growth form)
@yangsophieee or I have code for all of them except dataset_check_duplicates_across_datasets (old, in-need-of-rewrite code from a few years ago).
What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?
After updating
adding_data_long
in the traits.build-book repo I realised that there are a number of issues (e.g. issue 71 and standing requests from AusTraits users for a category of function I'll calldataset_check
functions.These are functions to look for suspicious patterns in data (outliers, duplicates within or across datasets) or to work out substitutions/taxonomic updates required. But they can't be part of
dataset_test
because there isn't the assumption that they can/will be "solved".The code for some of the "tricks" are still in
adding_data_long
(generating a list of needed categorical trait value substitutions or taxonomic alignments). But I wonder if this should be a standalone "chapter" in traits.build-book that goes through each of these:dataset_check_categorical_substitutions
- outputs substitutions required (based on excluded data)dataset_check_numeric_values
- outputs out-of-range numeric values (i.e. those in excluded data)dataset_check_taxonomic_updates
- outputs list of original names not in taxon_listdataset_check_not_pivoting
- identifies the measurements that are preventing a dataset from pivotingdataset_check_outlier_by_genus
- for numeric values, looks for values that are x% higher/lower than other values within the genus; only runs is n for trait genus > 10 (or some other #)dataset_check_outlier_by_species
- for numeric values, looks for values that are x% higher/lower than other values within the species; only runs is n for trait species> 10 (or some other #)dataset_check_duplicates_across_datasets
- search for duplicates across datasets, tricky to figure out sig figures, and problem that for some traits, duplication likely (e.g. %N, where there is a narrow range)dataset_check_duplicates_within_dataset
- raise flag if all values for given taxon*trait are identical (i.e. growth form)@yangsophieee or I have code for all of them except
dataset_check_duplicates_across_datasets
(old, in-need-of-rewrite code from a few years ago).What is the best way to share these functions/code. Just in the book chapter/vignette? Or also as an additional R file somewhere?